### Basic remarks on parallel computing

Parallel computing is a type of computation in which two or more computations are executed at the same time. The concept is well established in computer science and engineering, but there are lots of misunderstandings from general public in general forums, blogs, social media, and comments sections of news sites.

I see very often that people believes that multithreading is the only type of parallel computing, and completely unaware that a twelve core CPU can be faster than a thirty two core CPU, if the twelve core CPU has four times better SIMD hardware. I also see very often people believing that everything can be made parallel; this belief is accompanied by unfair accusations that game developers are "lazy", as the reason why games do not scale to thirty two cores today. A similar accusation is made to CPU engineers, with people thinking that x86 CPUs have increased IPC only by about 5% during last generations because engineers "rested on laurels". Such accusations are followed by unfounded beliefs that game developers and CPU engineers will cease to be lazy and soon we will see games that require thirty two cores and new CPUs with 30% higher IPC.

Most of those myths and misunderstandings are caused by a lack of familiarity with hardware and software aspects of computation, but sometimes companies play a key role on misleading the general public. For instance, Nvidia likes to share graphs showing how GPUs performance grows faster than CPUs, and some people has taken this to pretend that "soon GPUs will replace CPUs".

Parallel computing is a broad and complex topic. Each one of the topics discussed here could fill an entire book, so I will write here only a very basic introduction to the topic. I hope this article will give the needed background to figth those myths that spread in general forums, blogs, social media, and comments sections of news sites.

#### There are several different forms of parallel computing: bit-level, instruction-level, data-level, and task-level

I briefly comment on each one of those forms of parallelism.

Bit-level

A form of parallel computing based on increasing processor word size. Increasing the word size reduces the number of instructions the processor must execute in order to perform an operation on variables whose sizes are greater than the length of the word. For example, a 32-bit processor can add two 32-bit integers with a single instruction, whereas a 16-bit processor require two instructions to complete the single operation. The 16-bit processor must first add the 16 lower-order bits from each integer, then add the 16 higher-order bits. Each 32-bit integer

10110101 10110101 10110101 10110101


is splinted as

10110101 10110101
10110101 10110101


The advantage of bit-level parallelism is that it is independent of the application, because it is running on the processor level. The programmer writes the operation and the hardware executes the operation in a single step or in several steps, depending of the hardware capabilities.

Instruction-level

The ability of executing two or more instructions at the same time. Consider the arithmetic operations

a = a + 10
b = m + 3


Since the second operation does not depend on the result of the first operation, both operations can be executed on parallel

a = a + 10;            b = m + 3


reducing the execution time to one half.

Data-level

Data parallelism is the execution of multiple data units in the same time by applying the same operation to them. Data parallelism is implemented in SIMD architectures (Single Instruction Multiple Data).

Suppose we want to move a series of objects a fixed distance in the z axis, this is equivalent to adding the distance to the z coordinate of each object

z1 = z1 + 61
z2 = z2 + 61
z3 = z3 + 61
z4 = z4 + 61
z5 = z5 + 61
z6 = z6 + 61
z7 = z7 + 61
z8 = z8 + 61


In a 4-way SIMD architecture, the operation can be applied to four objects at once, reducing the cycle time by a factor of four. First the coordinates are grouped in 4-wide vectors, then vector addition is executed

(z1, z2, z3, z4) = (z1, z2, z3, z4) + (61,61,61,61)
(z5, z6, z7, z8) = (z5, z6, z7, z8) + (61,61,61,61)


The wider is the SIMD architecture, the more is reduced the cycle time. An 8-way architecture could do all the operations in a single step.

Task parallelism is the mode of parallelism where the tasks are divided among the processors to be executed simultaneously. Thread-level parallelism is when an application runs multiple threads at once.

For ordinary parallelization, a programmer or the compiler analyzes the instructions in a serial stream, finding control and data dependences, partitioning the original stream into almost independent substreams (tasks), and inserting the necessary synchronization among tasks.

#### There are limits to the amount of parallelism

Not everything can be parallelized. Consider the equation $$ax^2 + bx + c = 0$$, the solutions are $x_1 = \frac{-b + \sqrt{b^2 - 4ac}}{2a}$ $x_2 = \frac{-b - \sqrt{b^2 - 4ac}}{2a}$ The elementary operations needed are

n1 = b * b
n2 = 4 * a * c
n3 = n1 - n2
n4 = SQRT(n3)
n5 = -b + n4
n6 = -b - n4
n7 = 2 * a
x1 = n5 / n7
x2 = n6 / n7


Some of those operations are independent, but others are not. For instance, we cannot do the subtraction n3 without first knowing the values n1 and n2, and we cannot do the divisions x1 and x2 without first knowing the value of denominator n7. The maximum achievable parallelism will be

n1 = b * b;            n2 = 4 * a * c;            n7 = 2 * a
n3 = n1 - n2
n4 = SQRT(n3)
n5 = -b + n4;          n6 = -b - n4
x1 = n5 / n7;          x2 = n6 / n7


It must be clear that the limitation to the level of parallelism illustrated by this simple example is not a consequence of lazy programming. The problem cannot be parallelized further due to data dependences among different operations.

Even for problems that can be parallelized, programmers have to confront to hardware and software limits. For instance, consider the sequence of instructions illustrated in the left hand of next figure. The programmer has partitioned the code into an initialization subtask (1), a call to a procedure (2) that computes a function f that depends on variable Z, and a finalization subtask (3) that receives the value of the function. When executed linearly, the whole task takes a certain time to finish.

The three subtask cannot start at the same time because the procedure depends on variables set by subtask (1), and the finalization subtask (3) depends on the value of f computed by (2). Some of you could say me that since we know the value of the variable Z, we could simply add that value to the beginning of the procedure and thus accelerate its execution. Well, yes, you can do that if the value of the variable is known at compile time, but if the value is only known at run-time --e.g. from user input or from data transmitted through the wire-- then the variable Z needs to be computed by the initialization subtask before passing its value as parameter to the procedure.

To parallelize the execution of the procedure, we have to insert extra code in subtask (1) in the parent thread. This fork code creates a new thread and pass the needed parameters to the procedure. Once the procedure is executed, extra code added to the new thread takes the result of the evaluation of the function f and joins with the thread that called the procedure, to continue with execution of subtask 3.

The total time of execution is now smaller than in the original sequential execution. Parallelization has sped up execution. However, note that the code to execute now is larger because we have the fork and join sections. This is the overhead of the parallelization; it is extra code is not present in the original sequential algorithm, but it is code needed to synchronize and communicate the different threads in a parallel algorithm. In the above example I considered there is enough overlap between subtasks (1) and (2), there is only one procedure, and that fork and join overheads are small compared to the subtasks computation. However, things can be different and the cost of constructing and managing a thread can be greater than the computation time of the subtask itself; this happens for instance if the subtask is too small to compensate the overheads, in whose case the parallel algorithm can be actually slower than the serial algorithm.

The conclusion is that even for tasks that can be parallelized, the programmer has to evaluate the pros and cons and implement the optimal degree of parallelization. More parallel does not always implies faster!

Game developers are bound by similar limits. A game is essentially a sequential algorithm where the state of the game at any instant of the gameplay evolves as a function of the user response. The main algorithm is

State_1 >>>> User_Action_1 >>>> State_2 >>>> User_Action_2 >>>> State_3 >>>> ···


This is what is known as the "game loop"; it has to be a sequential loops because it cannot perform the necessary computations and update the state before taking user input. Some components of the computations can be split from the main algorithm and run on a separate thread as subtasks, three examples are background music, physics effects, and artificial intelligence. This is what has been made in modern games to speed up games on multicore systems. However, those subtasks are not fully disjoint from the main algorithm, because they depend on the decisions taken by the player during the gameplay. For instance following a corridor and at the end, turning left and entering a room in a shooter game as Doom could mean finding a dozen of artificial intelligence enemies, whereas turning to the right could mean finding an arsenal with powerful weapons and a pair of first-aid kits. The thread that runs the artificial intelligence subtask has to be synchronized at any instant with the main thread that runs the game loop, so this introduces a limit to the degree of parallelism that can be achieved.

Games use several slave threads that run subtasks synchronized by the main thread that runs the game loop. Below I reproduce core loads for the game Call Of Duty: Modern Warfare 2 under DX11 API

As you can see the system is bottleneck by two cores. One core is running the main game loop, whereas another is running the main render thread. Adding more cores will not increase performance, because most of the cores are already idle. What modern APIs as Vulkan allow is to break the rendering thread into multiple threads, eliminating one bottleneck from the CPU. Next is a core load measured for the game Doom under Vulkan. This is a modern game running on a modern API. This is an example of the current state of the art on parallelization for games.

Since all the subtasks have to be synchronized by the main thread, the main thread will continue being the bottleneck to the system. A CPU with six or eight strong cores will continue being better at gaming than a CPU with twelve or sixteen weak cores.

Another limit to parallelism is introduced by some ISAs. The x86 ISA is a serial ISA. This mean that instructions are scheduled in linear order when the compiler transform our program into x86 instructions. Consider the next example again

a = a + 10
b = m + 3


The compiler would generate code such as

mov ecx, 10
mov edx, 3


This is a sequence of x86 instructions. Modern x86 cores as Zen or Skylake are superscalar out-of-order microarchitectures. Superscalar means the core has the ability to execute more than one instruction per cycle. Out-of-order means that it is capable of executing instructions in an order different to that defined by the compiler. At run-time, those cores will load the above sequence of instructions from memory or cache, then will decode them and will analyze the instructions to find dependences, generating a parallel schedule to reduce the time needed to execute the instructions. And here relies the problem. The hardware structures needed to transform a sequence of x86 instructions into an optimized parallel sequence are very complex and power hungry. In a superscalar core the IPC is given by $IPC = \alpha W^\beta$ where $$\alpha$$ and $$\beta$$ are parameters that depend on both the hardware and the code is being executed, and $$W$$ is the length of the sequence of x86 instructions that has to be analyzed to find parallelism. $$\beta <1$$, which implies a nonlinear relationship between performance and length of the sequence; In general, we can do the approximation $$\beta=1/2$$ and we recover a square-root law $IPC = \alpha \sqrt{W}$ Therefore, if we want to duplicate the IPC we need to improve the superscalar hardware to analyze four time more instructions! Some hardware structures of the core such as fetching or decoding will have to be scaled by a factor of four, but other structures need much more aggressive scaling. For example, to analyze the interdependences between two instruction we need only one comparison, because there is only one possible relationship between two any instructions, but increase up to four the number of instructions to analyze and we need six comparisons --we have to compare the first instruction with the second instruction, the first with the third, the first with the fourth, the second with the third, the second with the fourth, and finally the third instruction with the fourth instruction--. For eight instructions the numbers of comparisons needed is twenty eight. Detecting dependences among two thousand instructions requires almost two million comparisons! This cost obviously limits the number of instructions that can be considered for issue at once. Current cores have of the order of thousands of comparators.

This is not the only scaling problem. Code has branches, non-scientific code often has a branch each eight or ten instructions. So in order to fetch a sequence of hundred of instructions the core has to know, in advance, what paths will be taken on each bifurcation. This is the task of branch predictors. Imagine a predictor with average accuracy of 90%; this means that the prediction fails only in one branch of each ten. It can seem this very high accuracy solves the problem of branching in code, but accuracy reduces with each consecutive branch, because the probabilities of failed prediction increase with each new bifurcation. Consider a simple code example with only binary branches

Whereas the probability that the core is in the correct path after the first bifurcation point is of 90%, the probability that the core is in the correct path after the second branch is lower; the probability that the hardware correctly predicts the second branch continues being of 90%, but this probability is now bound to the probability of the core already being in the correct path after the first bifurcation. The probability the core is in the correct path after the second bifurcation point is now $\frac{90}{100} \frac{90}{100} = \frac{81}{100}$ The probability has reduced to 81%. After ten consecutive binary branches the probability reduces to about 35%; this means the core is analyzing an incorrect sequence of instructions in two of three occasions! Current state-of-art predictors are very complex and include correlating tables for predictions; those tables keep a record of the paths taken by the core before and use those tables to improve the branch prediction by predicting a new branch in the context of the former branches leading up to it, instead predicting each branch just in isolation. Current state-of-art predictors are power-hungry and take up valuable space on the core, but they can predict branches with an accuracy of 95% or even higher

One could think that being wrong in 5% of cases has a negligible impact on performance, but that is not correct. Things are not linear. When the core detect has been working in the incorrect path, it has to cancel all the speculative work has been doing in advance, flush the pipeline entirely, and start again at the early point in the sequence of instructions where the prediction failed. This affect performance; this is called the mispredict branch penalty. Even with predictors accurate to 95%, the mispredict penalty can reduce the performance of a high performance core by about one third. In other words, the core is not doing useful work one third of the time.

There are more scaling problems, and it is the reason why engineers have hit an IPC wall. Indeed, if we plot the IPC per year for x86 processors, we find an image as the next

The superscalar out-of-order microarchitectures behind the x86 ISA have hit a performance wall. No engineering team can break this wall; at best, engineers can spend years working on optimizing current microarchitectures to get a 2% IPC gain here and a 7% here. The only possibility to get a quantum jump on IPC over the current designs is if we change a serial ISA as x86 by a new ISA that scales up.

The existence of an IPC wall is not new. Research made decades ago about the limits of instruction level parallelism on code identified a soft wall about 10-wide cores. This wall was the reason why Intel engineers in collaboration with Hewlett Packard engineers developed a new ISA that would be scalable. The new ISA, dubbed EPIC, stands for Explicitly Parallel Instruction Computing. Intel wanted to use the migration from 32 bits to 64 bits to abandon x86 and replace it with EPIC. The plan failed badly because the promises of the new ISA could not be fulfilled. The reason? The ISA was developed around the concept of a smart compiler, which no one could build. As a consequence, EPIC-based hardware was penalized by executing non-optimal binaries.

#### GPUs are easier to scale up for throughput, but GPUs cannot replace CPUs

Nvidia likes to share slides as the next one

The evolution of GPUs is impressive, but the figure is measuring throughput. A GPU is an TCU (Throughput Compute Unit), whereas a CPU is a LCU (Latency Compute Unit); this is the terminology used by AMD in its HSA specification to classify heterogeneous compute units. GPUs are designed to crunch lots of numbers in a massively parallel way, when branches and the response time (latency) to changes in the code and/or data are not relevant; otherwise a CPU is needed, this is the reason why GPUs are not used to execute the operative system, for instance.

Why are GPUs easier to scale up for throughput? Consider a process node shrink that provides four times higher density; for instance a 14nm --> 7nm shrink. We could add four times more transistors in the same space. The key here is on how transistors are used in GPUs and CPUs. The relationship between IPC and number of transistors (the complexity of a microarchitecture) is not lineal, but approximately given by $IPC = a \sqrt{A}$ where $$a$$ is a parameter and $$A$$ is the area used by transistors that execute a single thread. This is the so-called Pollack's rule, which was derived empirically by him. For a theoretical derivation of the rule check this work by me.

Quadruplicating the number of transistors of a core will only produce a 100% increase in its IPC (we are assuming an ideal situation where there is no other bottleneck in the system, and there is enough instruction level parallelism in the code). Alternatively we could just build three more cores identical to the original in the die

In the first case, the higher density provided by the new node has been used to increase the performance of each core by 100%; the total throughput has duplicated as well. In the second case, the higher density has been used to quadruplicate the total throughput. The same node shrink can give us two different performance increases; however keep in mind that the twice higher throughput achieved by quadruplicating the number of cores does not come for free, because the latency is the same, whereas in the first case latency has also reduced by 100%.

#### Speculative parallelization

As we saw above, for ordinary parallelization, a programmer or the compiler analyzes the instructions in a serial stream, finding control and data dependences, partitioning the original stream into almost independent substreams (tasks), and inserting the necessary synchronization among tasks.

This partitioning into tasks requires all the relevant dependencies on the original stream of instructions to be known at compile-time; however, some dependences can be dynamical and only known at runtime, in whose case part of the parallelism existing in the stream of instructions cannot be exploited by ordinary parallelization.