Next article draft

* It seems you use compiler to extract ILP beyond basic blocks (trace level? hyperblocks?) and generate variable length bundles. I assume you use stop bits to mark the end of each VLIW bundle. Which is the difference with a compressed VLIW?

* What approach is used for solving the problem of hard-to-predict branches?

* Do bundles correspond to predefined VLIW slots for a fixed machine: i.e. [Branch] [Compare] [MADD] [MADD] [LOAD] ··· or are abstract bundles of independent microops? [uop1] [uop2] [uop3] [uop4] [uop5] ···?

* "Sustained up to 8 microops" seems the max ILP. Which is the sustained ILP for general code as SPEC suite: ILP ~ 1.72?

* The core seems to have a non-stalling in-order pipeline. The mention to poisson bits seems to indicate some kind of runahead mode. Can you confirm you use runahead during cache stalls? Do all instructions executed under runahead mode are reexecuted again during normal mode or only the poissoined instructions are reexecuted?

* One slide claims 32MB combined L2+L3 size, but another slide claims 256KB+512KB per core. So I think 32MB is only the L3 size. Are you using exclusive cache policies?

* What does "very short wires mitigating the slow wires problem" mean? Reducing long-wires?

* You claim that Itanium was stalled 50% of the time and Prodigy achieves less than 20% stall. However one slide shows 69% unstalled percentage for Prodigy. You claim normal OoO stalls 15% of time. I have difficulties to accept this value. Modern state-of-the-art OoO cores are stalled most of time.

* Do cores have boost policies or run at fixed 4GHz?