AMD RyZen predictions

It is that epoch of the year where I compare my predictions with reality. It worked well in the past with typical errors in the single digit percent. Let us check if the trend continues for the predictions I did for RyZen, despite this is a brand new microarchitecture instead an evolution of an existent microarchitecture. Note that most of my predictions are from years 2104 and 2015.

You can find lots of false claims about my predictions in the Internet; you can find even a funny guy that pretends that I had said that Zen could not pass the 2GHz. I will do some remarks about those claims below, meanwhile you can find a summary of my predictions in this post (older forum has gone). A 2016 update on my prediction of 8-core die size can be found in this other post (older forum has gone).

I predicted that Ryzen --then known as Zen-- would be made in 14LPP process at Globalfoundries. I predicted Zen was going to be a small core with a size of "~5mm²" (without L2) on the 14LPP process. This was confirmed by AMD at ISSCC talk this month. Zen measures 5.5mm². It is worth to mention that the symbol "~" means "around" or "approximately equal to". Effectively, Zen is a small core because, as mentioned by Fottemberg in this post (older forum has gone): "Usually, a Small Core is a core about 5-10 mm2 or smaller, and a Big Core is a core about 20 mm2 o bigger". The prediction of ~205mm² for the die size was also rather accurate, within a 4% of error, because the die measures 212.97mm² according to Hiroshige Goto.

My SMT2 prediction was also confirmed. Zen has 2-way SMT for multithreading. I also predicted that SMT yields on Zen would be bigger than for Intel because the Zen microarchitecture is more distributed: separate integer and FP clusters, more execution pipes, and non-unified scheduling. This is not about having a better SMT implementation than Intel, but about Intel microarchitecture extracting more ILP from a single thread and, thus, leaving less empty execution slots in the core ready for a second thread. I even offered the next example; it is oversimplified but enough to get my point. Imagine two cores A and B, both are 6-wide and both have the same SMT implementation; however, core A has deeper SS/OoO logic and can sustain an ILP of 4, whereas core B only can sustain an ILP of 3.

When executing a single thread the core A has more throughput --four vs three for core B--. because it can execute more instructions per cycle. But when executing a second thread, both cores have the same throughout of six instructions per cycle because a second thread can fill the unused execution resources on both cores.

Other predictions for the core like being 6-wide and 2x128 FMAC units were confirmed. However, I predicted (3 ALU + 3 AGU) for the integer part and Zen is (4 ALU + 2 AGU). As mentioned once by David Kanter in RWT my choice was better:
3 AGU + 3 ALU is a much better mix. Remember that x86 is load+op, so generally you want to sustain nearly a 1:1 ratio of memory to ALU operations.
The prediction of 8-core dies for the CPUs of the AM4 socket, with higher core count in servers obtained with combinations of the base die is also confirmed. So far like I know I am the only one that predicted a four-die configuration --8 core x 4 = 32 core-- for the top Naples chip. As you can check in the above links, I even offered the next illustration

And this is a recent shot of the Naples CPU, now renamed to EPYC,

I also predicted a separate die for the AM4 APUs with four-cores and iGPU, and mentioned that AMD could be offering two different lines of four-core CPUs for this socket, one line from the 8-core die with half the cores disabled and another line from the APUs with the iGPU disabled; something similar to what AMD does now with the FX-4000 series and the Athlon series.

I also predicted Zen would come in four-core clusters. This is what AMD names the CCX. My proposal was based in a hypothesis about AMD wanting to increase the minimal core count to four plus a hypothesis about AMD reusing the cluster for the APUs and for the semicustom division to reduce the design costs with a modular approach. My prediction was that Zen would come in groups of CCX with SMT disabled or not. For instance 4c/4t for the lowest Ryzen CPU and then 4c/8t and 8c/8t for the intermediate models, and 8c/16t for the flagship AM4 socket model. With servers coming in combinations of 8c/16t and 16c/32t for the dual die socket (SP4), and 16c/32t and 32c/64t for the quad die socket (SP3).

With this hypothesis in mind, I predicted a six-core would not exist. The R5 model would be 8c/8t, similar to how Intel i5s are 4c/4t in the desktop. Even some new sites reported early rumors about six-core Ryzen not existing, but this turned to be a error. Six core Ryzen exist! This is a bit weird. Why do Ryzen is designed about four-core clusters if you are going to disable cores within each CCX individually? We know now that the RyZen performance varies depending on what cores are disabled and what cores are active, because the weird design introduces a last level cache partitioned in blocks with different access latencies. For instance, a 4+0 chip --that is one CCX disabled in the die-- is not the same that a 3+1 chip or a 2+2 chip --two cores disabled per CCX--. Finally AMd took the decision of selling all the CPUs with symmetric configurations: 3+3 for the six-core chips and 2+2 for the quad-core chips. This even adds more weirdness to the CCX design choice, because independently of what cores are damaged in the die, AMD always has to pair of cores in both CCX in the die.

My prediction of IPC was "~50% IPC over Piledriver on scalar code. ~80% IPC on SIMD code". This is easy to understand. I predicted 2x128 FMA units for Ryzen; Piledriver has a 2x128 FMA unit per module, which accounts to 128bit per core. Thus Piledriver is a 8FLOP/core design, whereas Zen is 16FLOP/core. This is the maximum throughput. Average performance is less than the maximum, because not all the resources are duplicated. Recall that floating point codes tend to stress much more the memory subsystem, up to the point that many supercomputers are able to hit high peaks on HPC applications but their sustained performance is much less. This is why I predicted the floating point IPC in Zen would be about 80% better despite having twice more FP peak performance. About integer, Piledriver has (2 ALU + 2 AGU) per core. Recall that my hypothesis was that Ryzen would be (3 ALU + 3 AGU), which accounts for 50% more integer+memory resources than Piledriver. This was the basis for my "~50% IPC" claim where, evidently, I assumed that rest of resources (front-end, caches,...) would scale up conveniently to feed the extra ALUs and AGUs.

We know now that Ryzen is (4 ALU + 2 AGU), this means that the peak integer performance is better because Ryzen can execute up to four integer operations per cycle instead the three operations I had envisioned; but, on the opposite side, Ryzen has one AGU less than I expected, which means cannot feed the ALUs so well. The higher peak performance is compensated by the lower sustained performance and the net result is that the performance of Ryzen is very close to what I expected: 50% and 80%. Indeed, AMD has pretty much confirmed my IPC predictions, with official claims that Zen is 52% faster than Piledriver on SPECint and 76% faster than Piledriver on Cinebench, both clock-for-clock.

My estimations suggested strongly, as I put it dozens of times in many forums that
Zen IPC ~ Sandy Bridge
Zen IPC+SMT ~ Haswell
Reviews have found that on average Ryzen IPC = 1.05 Sandy Bridge and IPC+SMT = 0.90 Haswell. Effectively, clock for clock, Broadwell is about 10% ahead of Ryzen on applications

and about 20% ahead on games

Let us check now clocks. I finally predicted 3.0GHz base and 3.5GHz turbo for a 95W 8-core Ryzen CPU. And 3.4GHz base and 3.9GHz turbo for a 65W 4-core CPU. Of course, the predictions are for the faster models that fit within the thermal envelope. The more astute readers can check I was assuming quadratic dependence in my computations. Let me remark that so early as 19/08/2014 I was mentioning "3.0--3.5GHz" for Zen, which makes even more ridiculous some claims that one can find in the Internet. The truth is that the top 65W 4-core model is the R5 1500X, which has 3.5GHz and 3.9GHz frequencies. The prediction was very accurate with errors of 3% and 0% respectively.

The things are completely different for the 95W 8-core chips. The top model R7-1800X has 3.6GHz and 4.1GHz frequencies, which is a noticeable departure from the expected values. How can the same prediction be accurate for the 4-core chips but fail for 8-core chips, when both use the same die and the same process node? We finally have an answer. If we take the R5-1500X as baseline, we can obtain the next approximated estimation of frequencies for a 8-core 95W chip

sqrt[ 95W / (2 x 65W) ] * 3.5GHz = 2.99GHz

sqrt[ 95W / (2 x 65W) ] * 3.9GHz = 3.33GHz

The only possibility to get higher clocks is if we increase the TPD beyond the marketing value of 95W. Indeed, if we invert the above formulas and use as input the frequencies of the R7-1800X, we can estimate the real TDP of this 8-core chip. The result is 141W as an average of base and turbo estimations. Effectively, Ryzen reviews have demonstrated that the 95W is a marketing label and that the 1800X can dissipate even more power than that 125W-rated FX-8350 Piledriver chip; for instance, HFR measured 125W for the FX-8350 and 128.9W for the R7-1800X, which can be rounded to 129W. The 12W discrepancy between the measured 129W and the 141W estimated above can be easily explained by the fact that the number of active transistors in the R5-1500X is not strictly one-half those in the R7-1800X, by binning (the R5 models use defective dies), by small departures from a strict quadratic dependence between frequency and power.

Reviews have also found that the 65W rating for the R7 1700 is another marketing label; As CanardPC noticed "Le 1700 tire 90W en réalité. AMD bullshit son TDP." Luckily for us, AMD finally reported the real TDPs for the 8-core chips:
What are the TDPs, within the meaning of the consumption limit and therefore the maximum number of watts to be dissipated, of the Ryzen? AMD also communicates this value, but less markedly: 128 watts for the 1800X / 1700X, and 90 watts for the 1700. These are the values that are most comparable with the TDP communicated by Intel.

This explains very well the discrepancies on the predictions for the 8-core chips. Effectively the prediction of 3.0GHz base and 3.5GHz turbo for a 95W 8-core Ryzen CPU fits nicely with the clocks of the CPU with a real TDP of 90W: the R7-1700 model. Adjusting stock frequencies for 95W we obtain a prediction error of 6%, which is excellent.

Finally, let us return the fact some people is reporting false claims about my predictions in the Internet. For instance, they pretend that I said that the IPC of Ryzen would be exactly Sandy Bridge, ignoring that I always used the symbol "~". Contrary to what you would think in a first moment, their pretension does not follow from raw ignorance about the meaning of the symbol, because I explained them what the symbol means. The real reason is that those pretended 'experts' use ad hominem tactics to try to hide how wrong were their own predictions. Indeed, if you research a bit you will find those pretended 'experts' predicted CMT, big core (~10mm²), quad-channel for AM4, 2x256 bit or 3x256 bit FMA units, a range of 4.0--4.5GHz base clock 8-core 95W models and a superenthusiast 5GHz 8-core stock model, TSMC 16nm process, DDR3 optional, 12-cores or more for AM4, 2016 launch, a separate 16-core die for servers, 8-core 3.7GHz/4.1GHz 65W models, that Zen was not a SoC, IPC well above BDW and on pair with Kabylake, optional L3, less than 150mm² size for the 8-core die, and several dozens more of failed predictions.

Their failed predictions are not reduced to Zen. Those are the same 'experts' that predicted 3.2GHz or higher base clocks for the new Sony console on 14nm, HBM on Carrizo APU, 16CUs on the Bristol Ridge APU, that Nintendo Switch sales would be a flop, that no one would use the ARM ISA for servers/HPC, and so on and so on. I am still awaiting for them to admit a single error; instead they continue attacking people that did it right. It is also worth to mention that this same people has a long record of being permanently banned from lots of tech sites.