Extreme scale general-purpose processor

After a discussion of the limitations of current technology and why it cannot be scaled-up I propose an alternative technology for an extreme scale general purpose processor with a target of 60 double precision TFLOPS.

Introduction
############
############

Performance \(P\) is the product of the performance of each core and the number of cores \(N\) effectively used by the software \[ P = N P_\mathrm{core} \] The performance of each core is given by the product of the performance-per-cycle PPC --named IPC very often-- and the working frequency F \[ P_\mathrm{core} = PPC * F \] An actual x86 datacenter CPU provides a peak performance of 691.2 double precision GFLOPS of performance at TDP of 200W; our goal is to increase that performance by two orders of magnitude. A priori, we can increase any of the above factors \(N\), \(PPC\), \(F\) to increase the total performance, but there are limits. Classic scaling laws for silicon ceased to work about a decade ago; further node shrinks no longer reduce the operating voltage by one half and, as a consequence, frequencies ceased to increase. This is commonly named the "frequency wall", and it is visually perceived when we plot the evolution of the frequencies with the year

IMG

So we cannot increase the frequency of the current cores to 270GHz to get two orders of magnitude higher performance. We could then try to increase the performance-per-cycle. Actual datacenter x86 CPUs use superscalar out-of-order cores. As we discussed (previously) the complexity of this kind of cores increases nonlinearly with the PPC; an increase by two orders of magnitude would require new cores about 10000 times more complex than current x86 cores, which is simply impossible from a technical viewpoint; a single of those hypothetical ultra-high-performance cores would be 100 times bigger and complex than the largest commercial die today!

So the only remaining option to achieve the target performance seems to be add more cores to the existing x86 CPUs. This is also impossible, even with the increased transistor density provided by future nodes, we cannot make a 3200 core CPU using current x86 cores, because those cores are too big and power inefficient (as we will see in a moment below). A combination of improvements on all the factors at once, for instance, 48% higher frequency, 20% higher PPC, and 50 times more cores continues being technically impossible. We have to return to the draw board and desing something completely new.

Extreme scale general purpose processor
#######################################
#######################################

In the first place, we have to abandon wide superscalar out-of-order x86 cores, because they are very inefficient in both power and area terms. Indeed, a modern superscalar out-of-order x86 core can execute up to six RISC-like instructions (uops) per cycle; however, when executing the SPEC suite of benchmarks the average of execution is of less than 2 instructions per cycle. With ten execution ports, this means 80% of the back-end of the core is idle on average

IMG

One of the main reasons for such low average utilization of the execution resources in the core is the "memory wall". The memory wall refers to the huge discrepancy between the current speeds of CPU and memory. On modern x86 cores a cache miss can take so much as 300 execution cycles. To not stall, the x86 core would speculatively fetch and decode several thousand of instructions ahead of that causing the miss, and execute all the instructions that do not depend on the miss; however, current x86 cores only can keep 192 instructions in fligth, so most of the time they are waiting for main memory.

Even if we solve the memory wall problem somewhat, there is no enough Instruction Level Parallelism (ILP) in generic codes to keep all execution units busied each cycle. One often read studies claiming otherwise, several dozens or even hundred of ILP available in the code and ready to be exploited! But those studies are based in idealized conditions such as infinite instruction windows and register files, perfect branch predictors and caches, unlimited number of execution units, and so on. When one relaxes those conditions to realistic --still very optimistic-- conditions, the amount of parallelism drops drastically to about 3--6 ILP.

This combination of hardware and software limits is the reason why superscalar out-of-order x86 cores are very inefficient. A powerful datacenter-class x86 core has a GFLOPS/mm² ratio of about 2.5 on 14nm node. On the other hand a highly efficient 4-wide VLIW core can produce a performance of 2 GFLOPS but occupies only 0.15 mm² in the same node. This means we can pack 5 times more performance in the same size.

So we could fill our die with thousands of 4-wide VLIW core, and it would work fine for massively parallel workloads (with a parallelization degree of 95% or more), but it would fail for rest of code, per Amdahl law. Strong single thread performance is still a requirement, not only for workloads are not parallel in nature, but also for workloads that are parallelized by some kind of master-slave method, because the master thread is bound to the performance of the faster available core.

We could try to fill our die with two kind of cores at once. Lots of small and efficient cores for throughput (TC) accompanied by a handful of big and power hungry cores (LC) with high single thread performance for latency-bound code. We could run the operative system, master threads, and legacy code on the big cores and rest on the small cores. An heterogeneous design as the next,

IMG

but with 64 LC and 1024 TC for instance. A problem with this kind of design is that workloads have different requirements for latency and throughput. A massively parallel workload would require more TCs and less LCs, whereas another workload would require just the contrary. The split of the die into a fixed number of LCs and a fixed number of TCs is not the optimal configuration for specific workloads.

Moreover, we still have the performance scaling problems for the LC cores. It would be very difficult to build a LC core can sustain an ILP higher than two, bottlenecking the whole processor. We need a radical new approach.

I propose to use lots of small and efficient cores to produce the highest throughput possible by exploiting ordinary thread-level parallelism and then, dynamically, combine two or more of those cores to accelerate a single thread when needed. This cooperation between cores can be made in two main ways.

A possibility consists on executing the thread on a core whereas two or more cores execute helper threads, whose only purpose is to accelerate the main thread. This acceleration is achieved by pre-executing instructions ahead of the main thread to help solving difficult to predict branches and by prefetching data. Prefetching works by detecting cache misses before they happen in the core executing the main thread. Suppose instruction (LOAD X) in the main thread will generate a cache miss, with the core having to wait to the data to return from main memory. A helper thread could execute a copy of this instruction 300 cycles ahead of the main thread, generating the cache miss before the main thread needs the data, so that the data is already available on the cache when the main thread needs it, accelerating its execution.

IMG

A second possibility consists on breaking the thread that has to be accelerated into two or more smaller threads using speculative techniques. A discussion of thread-level speculative parallelization was given here.

The helper threads approach can speedup the main thread by decent amounts such as 40%, depending on the kind of code and the quality of the helper threads. For instance, if the main thread is memory-bound, then a helper thread generating accurate prefetching of data will produce higher acceleration than if the main thread is compute-bound and there are little cache misses. Ignoring implementation issues, the main defect of the helper threads approach is its limited speedup scaling. E.g. once you have an accurate helper thread prefechting data and eliminating cache misses from the main thread, the main thread cannot be further accelerated using this technique. Yes, you could launch a second helper thread doing branch prediction and eliminating mispredictions from the main thread, but once made the main thread cannot be further accelerated using this technique. You only can have one, two, maybe a handful more helper threads, which limits the speedup of the main thread.

Speculative parallelization is different. A priori, you always can split the thread that has to be accelerated into a large number of smaller threads. Instead splitting a thread into two threads and run them on two cores, split it into four threads and run them on four cores or split it into eight threads... you get the idea. This offers a huge advantage over helper threads, because one can always, in principle, add more cores to reduce the execution time. Speedups of one or two orders of magnitude are possible.

Performance for ordinary thread-level parallelization grows as \(O(N)\), with \(N\) the number of threads. Therefore a workload that needs a time \(T\) for completion on a single core will be reduced to \(T/N\) when it is split into \(N\) threads and executed on \(N\) cores. We are considering only the part of the workload that is parallelized, so limits from Amdahl law do not apply here. Performance for speculative thread-level parallelization grows only as \(O(\sqrt{N})\). Performance grows less than linearly due to speculation failures and control mechanism to verify the speculations.

It could seem that devoting 64 cores to produce a speedup of only eight is a waste of resources. But keep in mind this is exactly the same kind of nonlinear scaling that happens for instruction-level parallelism, where a speedup of two requires about four times more transistors. Now we could ask ourselves: if speculative thread-level parallelization has the same scaling law that instruction-level parallelism, would not suffer the same scaling technical issues, limiting the practical speedup? The response is negative. Instruction-level parallelism relies on centralized structures as single instruction cache, single branch predictor, single register file, single reorder buffer, and so on for doing speculation; whereas speculative thread-level parallelization relies on distributed structures: \(N\) instruction caches, \(N\) branch predictors, \(N\) register files,... Scaling up centralized structures increases latencies, and this affect performance by reducing the working clock of those structures. Or said in another way, you cannot desing a giant core with a 10 times bigger register file than current cores and keep the same 3GHz working frequency. However, distributed structures can be scaled up without affecting latencies, we can have 10 cores working at 3GHz instead a single core working at 3GHz.

============================================================ ============================================================ Research in progress.

Target:

5nm
>60 TFLOPS
Direct graphics rendering (no API)

Core pipeline:

        ________
       |        |
       |        V
[IF] [BT] [BO] [ID] [EX] [MEM] [WB]
|__|           |__________________|
RISC                  VLIW

BT = Binary Translator
BO = Binary Optimizer

Dies configuration:

|CCCC||MMMM||CCCC|
|CCCC||MMMM||CCCC|
|CCCC||MMMM||CCCC|
|CCCC||MMMM||CCCC|

M = memory
C = compute

Memory:

NVRAM···CHIP···NVRAM

Software pre-scheduling (optional)

In the original approach the BT+BO stages extract ILP from ordinary RISC-like sequence of instructions. Additionally we consider moving part of the hardware to the compiler. In this alternative design the compiler extracts ILP and produces abstract pseudo-VLIW bundles of arbitrary instructions with bundle size explicitly marked by the compiler with NOPs, like in the next example

add $r13 = $r3, $r0
sub $r16 = $r6, 3
;;
shl $r13 = $r13, 3
shr $r15 = $r15, 9
ld.w $r14 = 0 [$r4]
;;


;; = NOP

Then a simplified core detects the NOPs and builds the VLIW bundle corresponding to the machine execution model. This simplified core basically needs only a BT stage and moves most of the BO stage to the compiler, simplifying the hardware even more.

Latency:

The BO stage provides tolerance to instruction execution latencies. Runahead mode is used to tolerate cache miss latencies. Runahead with reuse is being evaluated. The goal is to provide OoO performance with nearly IO hardware complexity. During runahead mode the BO stage is shutdown and pipeline works with the bypass [BT] --> [ID]

Branches:

Multipath execution for hard-to-predict branches.

Execution engine:

Two VLIW configurations are being evaluated: 4-wide and 8-wide

[branch] [int] [mem] [float]

[branch] [int] [int] [int] [mem] [mem] [float] [float]

Comparison with other approaches:

             RISC --> VLIW         CISC --> VLIW         VLIW --> VLIW
-----------------------------------------------------------------------------
Transmeta                          Static (software)     Dynamic (software)
Denver       Dynamic (hardware)                          Dynamic (software)
This         Dynamic (hardware)                          Dynamic (hardware)

Comparison with superscalar:

This approach has two advantages over superscalar: efficiency and modularity.

The VLIW part of the pipeline is much simpler than supercalar pipeline of the same wide. We are here talking about one half or a third of the complexity of the superscalar approach; even the decode stage on a VLIW is simpler than the decode stage on superscalar. Fetch stages are similar, the binary translation stage in the new design is rather simple, all the complexity is hidden in the binary optimizer stage, but the new design allows a modular approach.

The ILP and OoO logic on a superscalar core work on uops, whereas the binary optimizer in the new design works on the target ISA instructions. This means that the optimizer has a synergy with the compiler; it is possible to move optimizations from the compiler to the core and backwards, finding the optimal hardware/software design, unlike on an superscalar approach where compiler and the superscalar logic are decoupled.

Configuration A:   Base code --> Optimization 1 --> Optimization 2 --> Optimization 3  --> Executing code
                                 |____________|       |_________________________________________________|
                                    Compiler                             Hardware

Configuration B:   Base code --> Optimization 1 --> Optimization 2 --> Optimization 3  --> Executing code
                                 |_______________________________|     |________________________________|
                                              Compiler                              Hardware

Configuration C:   Base code --> Optimization 1 --> Optimization 2 --> Optimization 3  --> Executing code
                                 |__________________________________________________|      |____________|
                                                       Compiler                               Hardware

It is also possible to segment the hardware optimizations and apply them in a modular way depending of different factors such as latency limits, power consumption, complexity of the code, and so on. This can be understodd as a hardware version of the On flags on a compiler. A basic modularization is shown in the pipeline, where the whole BT stage is bypassed after a cache miss, but more complex bypasses can be envisioned

        _____________________
       |     ________________|
       |    |      __________|
       |    |     |      ____|
       |    |     |     |    |
       |    |     |     |    V
[IF] [BT] [BO1] [BO2] [BO3] [ID] [EX] [MEM] [WB]

This core can be thought as a hybrid core between superscalar and VLIW. Modifying the design point or the runtime parameters, the core can perform more like a superscalar or like a VLIW. E.g. if we remove most of the BO stage and execute the above alternative code "Software pre-scheduling (optional)", then the core would just work as a compressed VLIW. My goal is to place the design point closer to VLIW than to superscalar.

superscalar <··················[·]········> VLIW