Extreme scale general-purpose processor

Research in progress.

Target:
5nm
>60 TFLOPS
Direct graphics rendering (no API)

Core pipeline:
        ________
       |        |
       |        V
[IF] [BT] [BO] [ID] [EX] [MEM] [WB]
|__|           |__________________|
RISC                  VLIW

BT = Binary Translator
BO = Binary Optimizer

Dies configuration:
|CCCC||MMMM||CCCC|
|CCCC||MMMM||CCCC|
|CCCC||MMMM||CCCC|
|CCCC||MMMM||CCCC|

M = memory
C = compute

Memory:
NVRAM···CHIP···NVRAM
Software pre-scheduling (optional)

In the original approach the BT+BO stages extract ILP from ordinary RISC-like sequence of instructions. Additionally we consider moving part of the hardware to the compiler. In this alternative design the compiler extracts ILP and produces abstract pseudo-VLIW bundles of arbitrary instructions with bundle size explicitly marked by the compiler with NOPs, like in the next example
add $r13 = $r3, $r0
sub $r16 = $r6, 3
;;
shl $r13 = $r13, 3
shr $r15 = $r15, 9
ld.w $r14 = 0 [$r4]
;;


;; = NOP
Then a simplified core detects the NOPs and builds the VLIW bundle corresponding to the machine execution model. This simplified core basically needs only a BT stage and moves most of the BO stage to the compiler, simplifying the hardware even more.

Latency:

The BO stage provides tolerance to instruction execution latencies. Runahead mode is used to tolerate cache miss latencies. Runahead with reuse is being evaluated. The goal is to provide OoO performance with nearly IO hardware complexity. During runahead mode the BO stage is shutdown and pipeline works with the bypass [BT] --> [ID]

Branches:

Multipath execution for hard-to-predict branches.

Execution engine:

Two VLIW configurations are being evaluated: 4-wide and 8-wide
[branch] [int] [mem] [float]

[branch] [int] [int] [int] [mem] [mem] [float] [float]



Comparison with other approaches:
             RISC --> VLIW         CISC --> VLIW         VLIW --> VLIW
-----------------------------------------------------------------------------
Transmeta                          Static (software)     Dynamic (software)
Denver       Dynamic (hardware)                          Dynamic (software)
This         Dynamic (hardware)                          Dynamic (hardware)


Comparison with superscalar:

This approach has two advantages over superscalar: efficiency and modularity.

The VLIW part of the pipeline is much simpler than supercalar pipeline of the same wide. We are here talking about one half or a third of the complexity of the superscalar approach; even the decode stage on a VLIW is simpler than the decode stage on superscalar. Fetch stages are similar, the binary translation stage in the new design is rather simple, all the complexity is hidden in the binary optimizer stage, but the new design allows a modular approach.

The ILP and OoO logic on a superscalar core work on uops, whereas the binary optimizer in the new design works on the target ISA instructions. This means that the optimizer has a synergy with the compiler; it is possible to move optimizations from the compiler to the core and backwards, finding the optimal hardware/software design, unlike on an superscalar approach where compiler and the superscalar logic are decoupled.
Configuration A:   Base code --> Optimization 1 --> Optimization 2 --> Optimization 3  --> Executing code
                                 |____________|       |_________________________________________________|
                                    Compiler                             Hardware

Configuration B:   Base code --> Optimization 1 --> Optimization 2 --> Optimization 3  --> Executing code
                                 |_______________________________|     |________________________________|
                                              Compiler                              Hardware

Configuration C:   Base code --> Optimization 1 --> Optimization 2 --> Optimization 3  --> Executing code
                                 |__________________________________________________|      |____________|
                                                       Compiler                               Hardware

It is also possible to segment the hardware optimizations and apply them in a modular way depending of different factors such as latency limits, power consumption, complexity of the code, and so on. This can be understodd as a hardware version of the On flags on a compiler. A basic modularization is shown in the pipeline, where the whole BT stage is bypassed after a cache miss, but more complex bypasses can be envisioned
        _____________________
       |     ________________|
       |    |      __________|
       |    |     |      ____|
       |    |     |     |    |
       |    |     |     |    V
[IF] [BT] [BO1] [BO2] [BO3] [ID] [EX] [MEM] [WB]


This core can be thought as a hybrid core between superscalar and VLIW. Modifying the design point or the runtime parameters, the core can perform more like a superscalar or like a VLIW. E.g. if we remove most of the BO stage and execute the above alternative code "Software pre-scheduling (optional)", then the core would just work as a compressed VLIW. My goal is to place the design point closer to VLIW than to superscalar.
superscalar <··················[·]········> VLIW