AMD RyZen predictions

It is that epoch of the year where I compare my predictions with reality. It worked well in the past with typical errors in the single digit percent. Let us check if the trend continues for the predictions I did for RyZen, despite this is a brand new microarchitecture instead an evolution of an existent microarchitecture. Note that most of my predictions are from years 2104 and 2015.

I predicted that RyZen --then known like Zen-- would be made in 14LPP process at Globalfoundries. I predicted Zen was going to be a small core with a size of "~5mm²" (without L2) on the 14LPP process. This was confirmed by AMD at ISSCC talk this month. Zen measures 5.5mm². It is worth to mention that the symbol "~" means "around" or "approximately equal to".

My SMT2 prediction was also confirmed. Zen has 2-way SMT for multithreading. I also predicted that SMT yields on Zen would be bigger than for Intel because the Zen microarchitecture is more distributed: separate integer and FP clusters, more execution pipes, and non-unified scheduling. This is not about having a better SMT implementation than Intel, but about Intel microarchitecture extracting more ILP from a single thread and, thus, leaving less empty slots in the core for a second thread. I even offered the next example; it is oversimplified but enough to get my point. Imagine two cores A and B, both are 6-wide and both have the same SMT implementation; however, core A has deeper SS/OoO logic and can sustain an ILP of 4, whereas core B only can sustain an ILP of 3.

When executing a single thread the core A has more throughput --four vs three for core B--. because it can execute more instructions per cycle. But when executing a second thread, both cores have the same throughout of six instructions per cycle because a second thread can fill the unused execution resources on both cores.

Other predictions for the core like 6-wide and 2x128 FMAC units were confirmed. However, I predicted (3 ALU + 3 AGU) for the integer part and Zen is (4 ALU + 2 AGU). As mentioned once by David Kanter my choice was better:
3 AGU + 3 ALU is a much better mix. Remember that x86 is load+op, so generally you want to sustain nearly a 1:1 ratio of memory to ALU operations.
8-core dies for the CPUs of the AM4 socket, with higher core count in servers obtained with combinations of the base die is also confirmed. So far like I know I am the only one that predicted a four-die configuration --8 core x 4 = 32 core-- for the top Naples chip. I even offered an illustration.

I predicted a separate die for the AM4 APUs with four-cores and iGPU. And mentioned that AMD could be offering two different lines of four-core CPUs for this socket, one line from the 8-core die with half the cores disabled and another line from the APUs with the iGPU disabled; something similar to what AMD does now with the FX-4000 series and the Athlon series.

I also predicted Zen would come in four-core clusters. This is what AMD names the CCX. My proposal was based in a hypothesis about AMD wanting to increase the minimal core count to four and a hypothesis about reusing the cluster for the APUs and for the semicustom division to reduce the design costs with a modular approach. My prediction was that Zen would come in groups of CCX with SMT disabled or not. For instance 4c/4t for the lowest Ryzen CPU and then 4c/8t and 8c/8t for the intermediate models, and 8c/16t for the flagship AM4 socket model. With servers coming in combinations of 8c/16t and 16c/32t for the dual die socket (SP4), and 16c/32t and 32c/64t for the quad die socket (SP3).

With this hypothesis in mind I predicted a six-core would not exist. The R5 model would be 8c/8t, similar to how Intel i5 are 4c/4t in the desktop. Even some new sites reported rumors about six-core Ryzen not existing, but this turned ot be a error. Six core Ryzen exist. This is weird. Why do Ryzen is designed about four-core clusters if you are going to disable cores individually? Apart from the last level cache being partitioned with different access latencies, we know that the performance varies depending on what cores are disabled and what cores are not. For instance, not all four-core CPUs are the same a 4+0 chip --one CCX disabled-- is not the same that a 3+1 chip or a 2+2 chip --two cores disabled per CCX--. There is even a rumor that 3+1 is not going to retail, because of the unbalance that generates having 3-cores in one side and 1-core in the other.

My prediction of IPC was "~50% IPC over Piledriver on scalar code. ~80% IPC on SIMD code". This is easy to understand. I predicted 2x128 FMA units for Ryzen. Piledriver has a 2x128 FMA unit per module, which accounts to 128bit per core. Thus Piledriver is a 8FLOP/core design whereas Zen is 16FLOP/core. This is the maximum throughput. Average performance is less because not all the resources are duplicated. For instance Haswell has twice the max. throughput than Sandy/Ivy Bridge, but real performance is about 70% or 80% better on code that can use the new 256bit SIMD units on Haswell. Recall that floating point codes tend to stress much more the memory subsystem, up to the point that many supercomputers are able to hit high peaks on HPC applications but sustained performance is much less. This is why I predicted the floating point IPC in Zen would be about 80% better despite having twice more FP peak. About integer, Piledriver has (2 ALU + 2 AGU) per core. My hypothesis was that Ryzen would be (3 ALU + 3 AGU), which accounts for 50% more integer resources. This is the basis for my "~50% IPC" claim where, evidently, I assumed that rest of resources (front-end, caches,...) would scale up conveniently to feed the extra ALUs and AGUs.

We know that Ryzen is (4 ALU + 2 AGU), this means that peak integer performance is better because Ryzen can execute up to four integer operations per cycle instead three, but on the opposite side Ryzen has one AGU less than expected, which means cannot feed the ALUs so well as I expected. It is very interesting that I even offered a tentative explanation to why Ryzen has only two AGUs per core. I said then that it could be related to some problem with the cache subsystem. Adding a third AGU is only consuming power and die area if the cache cannot sustain loads and stores for all the AGUs. It is very interesting that there are persistent rumors about Ryzen having a problem with caches or with the memory controller. Independently of which is the reason for an unusual 4+2 design, the point is that the performance is similar to what I expected. Indeed, AMD has pretty much confirmed my IPC predictions, with official claims that Zen is 52% faster than Piledriver on SPECint and 76% faster than Piledriver on CB15, both clock-for-clock.

Finally let us check clocks. I finally predicted 3.0GHz base and 3.5GHz turbo for a 95W 8c Ryzen CPU. And 3.4GHz base and 3.9GHz turbo for a 65W 4c CPU. Everyone can check I was assuming quadratic dependence for the frequency. Let me remark that so early as 19/08/2014 I was mentioning "3.0--3.5GHz" for Zen.

If leaks are correct the prediction for the 4c chips was accurate. The prediction for the 8c chips looks wrong. However, it is still early to compute the error, because there are persistent rumors that the 95W rating is a marketing label and that in reality the real TDP is much higher. There are also persistent rumors about AMD cherry picking silicon for the top 8c models. We have to wait.