tag:blogger.com,1999:blog-35645356523062173262018-11-02T06:36:02.283-07:00juanrgaJuan Ramón González Álvareznoreply@blogger.comBlogger13125tag:blogger.com,1999:blog-3564535652306217326.post-8267643717343240832018-10-05T15:45:00.003-07:002018-11-02T06:36:02.206-07:00Basic remarks on parallel computingParallel computing is a type of computation in which two or more computations are executed at the same time. The concept is well established in computer science and engineering, but there are lots of misunderstandings from general public in general forums, blogs, social media, and comments sections of news sites. <br><br>I see very often that people believes that multithreading is the only type of parallel computing, and completely unaware that a twelve core CPU can be faster than a thirty two core CPU, if the twelve core CPU has four times better SIMD hardware. I also see very often people believing that everything can be made parallel; this belief is accompanied by unfair accusations that game developers are "<i>lazy</i>", as the reason why games do not scale to thirty two cores today. A similar accusation is made to CPU engineers, with people thinking that x86 CPUs have increased IPC only by about 5% during last generations because engineers "<i>rested on laurels</i>". Such accusations are followed by unfounded beliefs that game developers and CPU engineers will cease to be lazy and soon we will see games that require thirty two cores and new CPUs with 30% higher IPC. <br><br>Most of those myths and misunderstandings are caused by a lack of familiarity with hardware and software aspects of computation, but sometimes companies play a key role on misleading the general public. For instance, Nvidia likes to share graphs showing how GPUs performance grows faster than CPUs, and some people has taken this to pretend that "<i>soon GPUs will replace CPUs</i>". <br><br>Parallel computing is a broad and complex topic. Each one of the topics discussed here could fill an entire book, so I will write here only a very basic introduction to the topic. I hope this article will give the needed background to figth those myths that spread in general forums, blogs, social media, and comments sections of news sites. <br><br><br><h4>There are several different forms of parallel computing: bit-level, instruction-level, data-level, and task-level</h4><br>I briefly comment on each one of those forms of parallelism. <br><br><b><u>Bit-level</u></b><br><br>A form of parallel computing based on increasing processor word size. Increasing the word size reduces the number of instructions the processor must execute in order to perform an operation on variables whose sizes are greater than the length of the word. For example, a 32-bit processor can add two 32-bit integers with a single instruction, whereas a 16-bit processor require two instructions to complete the single operation. The 16-bit processor must first add the 16 lower-order bits from each integer, then add the 16 higher-order bits. Each 32-bit integer <pre><br /><code><br />10110101 10110101 10110101 10110101<br /></code><br /></pre>is splinted as <pre><br /><code><br />10110101 10110101<br />10110101 10110101<br /></code><br /></pre>The advantage of bit-level parallelism is that it is independent of the application, because it is running on the processor level. The programmer writes the operation and the hardware executes the operation in a single step or in several steps, depending of the hardware capabilities. <br><br><b><u>Instruction-level</u></b><br><br>The ability of executing two or more instructions at the same time. Consider the arithmetic operations <pre><br /><code><br />a = a + 10<br />b = m + 3<br /></code><br /></pre>Since the second operation does not depend on the result of the first operation, both operations can be executed on parallel <pre><br /><code><br />a = a + 10; b = m + 3<br /></code><br /></pre>reducing the execution time to one half. <br><br><b><u>Data-level</u></b><br><br>Data parallelism is the execution of multiple data units in the same time by applying the same operation to them. Data parallelism is implemented in SIMD architectures (Single Instruction Multiple Data). <br><br>Suppose we want to move a series of objects a fixed distance in the z axis, this is equivalent to adding the distance to the z coordinate of each object <pre><br /><code><br />z1 = z1 + 61<br />z2 = z2 + 61<br />z3 = z3 + 61<br />z4 = z4 + 61<br />z5 = z5 + 61<br />z6 = z6 + 61<br />z7 = z7 + 61<br />z8 = z8 + 61<br /></code><br /></pre>In a 4-way SIMD architecture, the operation can be applied to four objects at once, reducing the cycle time by a factor of four. First the coordinates are grouped in 4-wide vectors, then vector addition is executed <pre><br /><code><br />(z1, z2, z3, z4) = (z1, z2, z3, z4) + (61,61,61,61)<br />(z5, z6, z7, z8) = (z5, z6, z7, z8) + (61,61,61,61)<br /></code><br /></pre>The wider is the SIMD architecture, the more is reduced the cycle time. An 8-way architecture could do all the operations in a single step. <br><br><b><u>Task-level</u></b><br><br>Task parallelism is the mode of parallelism where the tasks are divided among the processors to be executed simultaneously. Thread-level parallelism is when an application runs multiple threads at once. <br><br>For ordinary parallelization, a programmer or the compiler analyzes the instructions in a serial stream, finding control and data dependences, partitioning the original stream into almost independent substreams (tasks), and inserting the necessary synchronization among tasks. <br><br><br><h4>There are limits to the amount of parallelism</h4><br>Not everything can be parallelized. Consider the equation \( ax^2 + bx + c = 0 \), the solutions are \[ x_1 = \frac{-b + \sqrt{b^2 - 4ac}}{2a} \] \[ x_2 = \frac{-b - \sqrt{b^2 - 4ac}}{2a} \] The elementary operations needed are <pre><br /><code><br />n1 = b * b<br />n2 = 4 * a * c<br />n3 = n1 - n2<br />n4 = SQRT(n3)<br />n5 = -b + n4<br />n6 = -b - n4<br />n7 = 2 * a<br />x1 = n5 / n7<br />x2 = n6 / n7<br /></code><br /></pre>Some of those operations are independent, but others are not. For instance, we cannot do the subtraction <code>n3</code> without first knowing the values <code>n1</code> and <code>n2</code>, and we cannot do the divisions <code>x1</code> and <code>x2</code> without first knowing the value of denominator <code>n7</code>. The maximum achievable parallelism will be <pre><br /><code><br />n1 = b * b; n2 = 4 * a * c; n7 = 2 * a<br />n3 = n1 - n2<br />n4 = SQRT(n3)<br />n5 = -b + n4; n6 = -b - n4<br />x1 = n5 / n7; x2 = n6 / n7<br /></code><br /></pre>It must be clear that the limitation to the level of parallelism illustrated by this simple example is <i>not a consequence of lazy programming</i>. The problem cannot be parallelized further due to <i>data dependences</i> among different operations. <br><br>Even for problems that can be parallelized, programmers have to confront to hardware and software limits. For instance, consider the sequence of instructions illustrated in the left hand of next figure. The programmer has partitioned the code into an initialization subtask (1), a call to a procedure (2) that computes a function f that depends on variable Z, and a finalization subtask (3) that receives the value of the function. When executed linearly, the whole task takes a certain time to finish. <br><br><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-Elh-ol8fsJI/W7ulcm6FuiI/AAAAAAAABHU/xR1i6C_wTVYtVmRVd3sYMD7jl8rrgFBQgCLcBGAs/s1600/Normal%2Bthreading.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://3.bp.blogspot.com/-Elh-ol8fsJI/W7ulcm6FuiI/AAAAAAAABHU/xR1i6C_wTVYtVmRVd3sYMD7jl8rrgFBQgCLcBGAs/s400/Normal%2Bthreading.png" width="400" height="268" data-original-width="563" data-original-height="377" /></a></div><br><br>The three subtask cannot start at the same time because the procedure depends on variables set by subtask (1), and the finalization subtask (3) depends on the value of f computed by (2). Some of you could say me that since we know the value of the variable Z, we could simply add that value to the beginning of the procedure and thus accelerate its execution. Well, yes, you can do that if the value of the variable is known at compile time, but if the value is only known at run-time --e.g. from user input or from data transmitted through the wire-- then the variable Z needs to be computed by the initialization subtask before passing its value as parameter to the procedure. <br><br>To parallelize the execution of the procedure, we have to insert extra code in subtask (1) in the parent thread. This <i>fork</i> code creates a new thread and pass the needed parameters to the procedure. Once the procedure is executed, extra code added to the new thread takes the result of the evaluation of the function f and <i>joins</i> with the thread that called the procedure, to continue with execution of subtask 3. <br><br>The total time of execution is now smaller than in the original sequential execution. Parallelization has sped up execution. However, note that the code to execute now is larger because we have the fork and join sections. This is the overhead of the parallelization; it is extra code is not present in the original sequential algorithm, but it is code needed to synchronize and communicate the different threads in a parallel algorithm. In the above example I considered there is enough overlap between subtasks (1) and (2), there is only one procedure, and that fork and join overheads are small compared to the subtasks computation. However, things can be different and the cost of constructing and managing a thread can be greater than the computation time of the subtask itself; this happens for instance if the subtask is too small to compensate the overheads, in whose case the parallel algorithm can be actually slower than the serial algorithm. <br><br>The conclusion is that even for tasks that can be parallelized, the programmer has to evaluate the pros and cons and implement the optimal degree of parallelization. More parallel does not always implies faster! <br><br>Game developers are bound by similar limits. A game is essentially a sequential algorithm where the state of the game at any instant of the gameplay evolves as a function of the user response. The main algorithm is <pre><br /><code><br />State_1 >>>> User_Action_1 >>>> State_2 >>>> User_Action_2 >>>> State_3 >>>> ···<br /></code><br /></pre>This is what is known as the "<i>game loop</i>"; it has to be a sequential loops because it cannot perform the necessary computations and update the state before taking user input. Some components of the computations can be split from the main algorithm and run on a separate thread as subtasks, three examples are background music, physics effects, and artificial intelligence. This is what has been made in modern games to speed up games on multicore systems. However, those subtasks are not fully disjoint from the main algorithm, because they depend on the decisions taken by the player during the gameplay. For instance following a corridor and at the end, turning left and entering a room in a shooter game as Doom could mean finding a dozen of artificial intelligence enemies, whereas turning to the right could mean finding an arsenal with powerful weapons and a pair of first-aid kits. The thread that runs the artificial intelligence subtask has to be synchronized at any instant with the main thread that runs the game loop, so this introduces a limit to the degree of parallelism that can be achieved. <br><br>Another limitation to parallelism comes from draw calls are dependent upon accessing the same memory location and must therefore execute sequentially. New APIs as Vulkan have eliminated programing limitations on former APIs. A set of benefits are coming from simplifying the APIs, eliminating unneeded intermediate layers that abstracted the underlying hardware, and allowing programmers to access the hardware in a more direct (and faster) way, reducing overhead and latency. Another of the limitations eliminated is coming from explicit multithreading. Former APIs as OpenGL loaded draw calls on a single thread context, which was then executed by the CPU, generating a bottleneck that forced the GPU to wait for the CPU to execute all the calls. New APIs as Vulkan are just offering a way to distribute the rendering workload across multiple cores. Vulkan can exploit good old <i>task parallelism</i>, whereas OpenGL could not; this is the reason why modern CPUs can do many more draw calls under Vulkan than under OpenGL, but you cannot eliminate sequential limitations as that mentioned above about calls accessing the same memory location. Vulkan is multitreaded, but it does not just magically scale to multiple cores available; Vulkan simply allows a multithreading model and provides the needed tools and mechanisms. It is the programmer who has to manage and synchronize the threads as in classic multithreaded CPU programming. Again there are limits to the degree of parallelism, as we saw before. <br><br>Games use several slave threads that run subtasks synchronized by the main thread that runs the game loop. Below I reproduce core loads for the game Call Of Duty: Modern Warfare 2 under DX11 API <br><br><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-uI02Z1wCmN4/W8Bk9jlq9mI/AAAAAAAABH0/50U1tShqrww7RKrSftLQEClISVCeBNWKgCLcBGAs/s1600/DX11%2BCoD%2BMW2%2BMP%2Bcore%2Bload.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://2.bp.blogspot.com/-uI02Z1wCmN4/W8Bk9jlq9mI/AAAAAAAABH0/50U1tShqrww7RKrSftLQEClISVCeBNWKgCLcBGAs/s400/DX11%2BCoD%2BMW2%2BMP%2Bcore%2Bload.png" width="400" height="102" data-original-width="613" data-original-height="157" /></a></div><br><br>As you can see the system is bottleneck by two cores. One core is running the main game loop, whereas another is running the main render thread. Adding more cores will not increase performance, because most of the cores are already idle. What modern APIs as Vulkan allow is to break the rendering thread into multiple threads, eliminating one bottleneck from the CPU. Next is a core load measured for the game Doom under Vulkan. This is a modern game running on a modern API. This is an example of the current state of the art on parallelization for games. <br><br><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-20hyLqZQR-o/W7umIE6CBZI/AAAAAAAABHc/3Fc6bWtN2_8-LK47omXIcZeN0aNrE4aqQCLcBGAs/s1600/Vulkan%2BDoom%2Bcore%2Bload.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://3.bp.blogspot.com/-20hyLqZQR-o/W7umIE6CBZI/AAAAAAAABHc/3Fc6bWtN2_8-LK47omXIcZeN0aNrE4aqQCLcBGAs/s400/Vulkan%2BDoom%2Bcore%2Bload.png" width="400" height="35" data-original-width="1600" data-original-height="138" /></a></div><br><br>Now core utilization has improved a lot of --click on the image for a zoom--, but most cores are still loaded under 50%. Only core number six is loaded at 98.7% of its capacity. This core is running the main thread, and it is the bottleneck of the system. You can add more cores to this system and the game Doom will not run faster, because the core number six is working nearly at full capacity. The rest of cores are at 44.2% load in <i>average</i>. This means you could reduce the number of cores to the half and the game would run the same. Reducing the number of cores doesn't mean that each core would be loaded at about 88.4%. Recall what we said about overheads! Less threads means there are less synchronizations and communications among the different threads, which implies the average load per core would be smaller, opening a door to further reducing the number of cores without affecting playability of the game. An eight core would be enough to run the game at same framerates, specially when we consider that the eight core CPU usually runs at higher clocks than a twenty four core CPU. <br><br>Since all the subtasks have to be synchronized by the main thread, the main thread will continue being the bottleneck to the system. A CPU with six or eight strong cores will continue being better at gaming than a CPU with twelve or sixteen weak cores. <br><br>Another limit to parallelism is introduced by some ISAs. The x86 ISA is a serial ISA. This mean that instructions are scheduled in linear order when the compiler transform our program into x86 instructions. Consider the next example again <pre><br /><code><br />a = a + 10<br />b = m + 3<br /></code><br /></pre>The compiler would generate code such as <pre><br /><code><br />mov ecx, 10<br />mov edx, 3<br />add eax, ecx<br />add ebx, edx<br /></code><br /></pre>This is a sequence of x86 instructions. Modern x86 cores as Zen or Skylake are superscalar out-of-order microarchitectures. <i>Superscalar</i> means the core has the ability to execute more than one instruction per cycle. <i>Out-of-order</i> means that it is capable of executing instructions in an order different to that defined by the compiler. At run-time, those cores will load the above sequence of instructions from memory or cache, then will decode them and will analyze the instructions to find dependences, generating a parallel schedule to reduce the time needed to execute the instructions. And here relies the problem. The hardware structures needed to transform a sequence of x86 instructions into an optimized parallel sequence are very complex and power hungry. In a superscalar core the IPC is given by \[ IPC = \alpha W^\beta \] where \(\alpha\) and \(\beta\) are parameters that depend on both the hardware and the code is being executed, and \(W\) is the length of the sequence of x86 instructions that has to be analyzed to find parallelism. \(\beta <1\), which implies a nonlinear relationship between performance and length of the sequence; In general, we can do the approximation \(\beta=1/2\) and we recover a square-root law \[ IPC = \alpha \sqrt{W} \] Therefore, if we want to duplicate the IPC we need to improve the superscalar hardware to analyze four time more instructions! Some hardware structures of the core such as fetching or decoding will have to be scaled by a factor of four, but other structures need much more aggressive scaling. For example, to analyze the interdependences between two instruction we need only one comparison, because there is only one possible relationship between two any instructions, but increase up to four the number of instructions to analyze and we need six comparisons --we have to compare the first instruction with the second instruction, the first with the third, the first with the fourth, the second with the third, the second with the fourth, and finally the third instruction with the fourth instruction--. For eight instructions the numbers of comparisons needed is twenty eight. Detecting dependences among two thousand instructions requires almost two million comparisons! This cost obviously limits the number of instructions that can be considered for issue at once. Current cores have of the order of thousands of comparators. <br><br>This is not the only scaling problem. Code has branches, non-scientific code often has a branch each eight or ten instructions. So in order to fetch a sequence of hundred of instructions the core has to know, <i>in advance</i>, what paths will be taken on each bifurcation. This is the task of branch predictors. Imagine a predictor with average accuracy of 90%; this means that the prediction fails only in one branch of each ten. It can seem this very high accuracy solves the problem of branching in code, but accuracy reduces with each consecutive branch, because the probabilities of failed prediction increase with each new bifurcation. Consider a simple code example with only binary branches <br><br><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-GSceR_Q3VNQ/W55N-9EztmI/AAAAAAAABDA/A2y-cLncubwvULDKFfHjYam0kaL2iTtJACLcBGAs/s1600/Branches.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-GSceR_Q3VNQ/W55N-9EztmI/AAAAAAAABDA/A2y-cLncubwvULDKFfHjYam0kaL2iTtJACLcBGAs/s400/Branches.png" width="366" height="400" data-original-width="500" data-original-height="547" /></a></div><br><br>Whereas the probability that the core is in the correct path after the first bifurcation point is of 90%, the probability that the core is in the correct path after the second branch is lower; the probability that the hardware correctly predicts the second branch continues being of 90%, but this probability is now bound to the probability of the core already being in the correct path after the first bifurcation. The probability the core is in the correct path after the second bifurcation point is now \[ \frac{90}{100} \frac{90}{100} = \frac{81}{100} \] The probability has reduced to 81%. After ten consecutive binary branches the probability reduces to about 35%; this means the core is analyzing an incorrect sequence of instructions in two of three occasions! Current state-of-art predictors are very complex and include correlating tables for predictions; those tables keep a record of the paths taken by the core before and use those tables to improve the branch prediction by predicting a new branch in the context of the former branches leading up to it, instead predicting each branch just in isolation. Current state-of-art predictors are power-hungry and take up valuable space on the core, but they can predict branches with an accuracy of 95% or even higher <br><br><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-KxTnjlKuIgI/W55OqAt9JnI/AAAAAAAABDI/WflVNREAbMYjbESd9UIf-sDAlmIRPotlwCLcBGAs/s1600/Zen_core_layout.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://2.bp.blogspot.com/-KxTnjlKuIgI/W55OqAt9JnI/AAAAAAAABDI/WflVNREAbMYjbESd9UIf-sDAlmIRPotlwCLcBGAs/s400/Zen_core_layout.jpg" width="400" height="268" data-original-width="625" data-original-height="418" /></a></div><br><br>One could think that being wrong in 5% of cases has a negligible impact on performance, but that is not correct. Things are not linear. When the core detect has been working in the incorrect path, it has to cancel all the speculative work has been doing in advance, flush the pipeline entirely, and start again at the early point in the sequence of instructions where the prediction failed. This affect performance; this is called the mispredict branch penalty. Even with predictors accurate to 95%, the mispredict penalty can reduce the performance of a high performance core by about one fourth. In other words, the core is not doing useful work one fourth of the time. <br><br>There are more scaling problems, and it is the reason why engineers have hit an IPC wall. Indeed, if we plot the IPC per year for x86 processors, we find an image as the next <br><br><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-Hbt2DV8UIvc/W55PZyuRGmI/AAAAAAAABDU/OL4ie74agHUKsRVpm4MskwYBO6DY2fl9wCLcBGAs/s1600/IPC_wall.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://4.bp.blogspot.com/-Hbt2DV8UIvc/W55PZyuRGmI/AAAAAAAABDU/OL4ie74agHUKsRVpm4MskwYBO6DY2fl9wCLcBGAs/s400/IPC_wall.png" width="400" height="163" data-original-width="467" data-original-height="190" /></a></div><br><br>The superscalar out-of-order microarchitectures behind the x86 ISA have hit a performance wall. No engineering team can break this wall; at best, engineers can spend years working on optimizing current microarchitectures to get a 2% IPC gain here and a 7% here. The only possibility to get a quantum jump on IPC over the current designs is if we change a serial ISA as x86 by a new ISA that scales up. <br><br>The existence of an IPC wall is not new. Research made decades ago about the limits of instruction level parallelism on code identified a soft wall about 10-wide cores. This wall was the reason why Intel engineers in collaboration with Hewlett Packard engineers developed a new ISA that would be scalable. The new ISA, dubbed EPIC, stands for Explicitly Parallel Instruction Computing. Intel wanted to use the migration from 32 bits to 64 bits to abandon x86 and replace it with EPIC. The plan failed badly because the promises of the new ISA could not be fulfilled. The reason? The ISA was developed around the concept of a smart compiler, which no one could build. As a consequence, EPIC-based hardware was penalized by executing non-optimal binaries. <br><br><br><h4>GPUs are easier to scale up for throughput, but GPUs cannot replace CPUs</h4><br>Nvidia likes to share slides as the next one <br><br><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-5TcYs-XfmHE/W55QDBl7AFI/AAAAAAAABDc/YQtr40z9voMHKiENKj8AMWAVMTR573hUwCLcBGAs/s1600/CPU_GPU_Nvidia.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-5TcYs-XfmHE/W55QDBl7AFI/AAAAAAAABDc/YQtr40z9voMHKiENKj8AMWAVMTR573hUwCLcBGAs/s400/CPU_GPU_Nvidia.png" width="400" height="226" data-original-width="530" data-original-height="299" /></a></div><br><br>The evolution of GPUs is impressive, but the figure is measuring <i>throughput</i>. A GPU is an TCU (Throughput Compute Unit), whereas a CPU is a LCU (Latency Compute Unit); this is the terminology used by AMD in its HSA specification to classify heterogeneous compute units. GPUs are designed to crunch lots of numbers in a massively parallel way, when branches and the response time (latency) to changes in the code and/or data are not relevant; otherwise a CPU is needed, this is the reason why GPUs are not used to execute the operative system, for instance. <br><br>Why are GPUs easier to scale up for throughput? Consider a process node shrink that provides four times higher density; for instance a 14nm --> 7nm shrink. We could add four times more transistors in the same space. The key here is on how transistors are used in GPUs and CPUs. The relationship between IPC and number of transistors (the complexity of a microarchitecture) is not lineal, but approximately given by \[ IPC = a \sqrt{A} \] where \(a\) is a parameter and \(A\) is the area used by transistors that execute a single thread. This is the so-called Pollack's rule, which was derived empirically by him. For a theoretical derivation of the rule check <a href="http://vixra.org/abs/1808.0037">this work</a> by me. <br><br>Quadruplicating the number of transistors of a core will only produce a 100% increase in its IPC (we are assuming an ideal situation where there is no other bottleneck in the system, and there is enough instruction level parallelism in the code). Alternatively we could just build three more cores identical to the original in the die <br><br><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-N0HsQPyD6a0/W55RI7EwGoI/AAAAAAAABDo/LqqXngPpZV0y-optqBvfA2yK8um4jnhIQCLcBGAs/s1600/Pollack%2BRule.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://4.bp.blogspot.com/-N0HsQPyD6a0/W55RI7EwGoI/AAAAAAAABDo/LqqXngPpZV0y-optqBvfA2yK8um4jnhIQCLcBGAs/s400/Pollack%2BRule.png" width="400" height="351" data-original-width="816" data-original-height="717" /></a></div><br><br>In the first case, the higher density provided by the new node has been used to increase the performance of each core by 100%; the total throughput has duplicated as well. In the second case, the higher density has been used to quadruplicate the total throughput. The same node shrink can give us two different performance increases; however keep in mind that the twice higher throughput achieved by quadruplicating the number of cores does not come for free, because the latency is the same, whereas in the first case latency has also reduced by 100%. <br><br><br><h4>Speculative parallelization</h4><br>As we saw above, for ordinary parallelization, a programmer or the compiler analyzes the instructions in a serial stream, finding control and data dependences, partitioning the original stream into almost independent substreams (tasks), and inserting the necessary synchronization among tasks. <br><br>This partitioning into tasks requires all the relevant dependencies on the original stream of instructions to be known at compile-time; however, some dependences can be dynamical and only known at runtime, in whose case part of the parallelism existing in the stream of instructions cannot be exploited by ordinary parallelization. <br><br>Speculative parallelization is a dynamic technique that speculates on the information is not available at compile time and parallelizes the stream of instructions in the presence of ambiguous dependences. Since the speculation can turn to be incorrect, speculative parallelization introduces further mechanisms for the detection of and recovery from actions that violate the ordering dictated by a sequential execution. An example is given below. A sequence of instructions is composed of two tasks, such as the first task produces a variable Z, which is consumed by the second task. Without speculation the second task has to wait to completion of the first task before taking the value of Z. Adding speculation to the second thread, we produce a value for Z before the first task completes, so that the second task can start execution early, in parallel with the first task. After the execution of the second task completes, it is needed to verify the speculation used was compatible with the real value produce by the first task; for instance if the speculation assumed Z=5 and the first task produced Z=5, then all is correct; however, if the speculation assumed Z=3, the correct value 5 have to be introduced in the second task and re-executed. <br><br><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-oARUKy4MGy0/W7umsK1P3BI/AAAAAAAABHo/LqvF-7d4oJk6rs4lYMFFy7mjM7R7qaXSwCLcBGAs/s1600/Speculative%2Bthreading.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://4.bp.blogspot.com/-oARUKy4MGy0/W7umsK1P3BI/AAAAAAAABHo/LqvF-7d4oJk6rs4lYMFFy7mjM7R7qaXSwCLcBGAs/s400/Speculative%2Bthreading.png" width="400" height="254" data-original-width="762" data-original-height="483" /></a></div><br><br>As you can see speculative parallelization speeds up the execution of the program when the speculation is correct; otherwise the program run slower than the original sequential version. In the general case, if you add too much speculation or if the speculation is not good enough, the result is usually worse than sequential execution; not only the program can run slower, but you are using resources (e.g. extra cores) that could be used to run other tasks. Due to all those difficulties speculative parallelization has not achieved mainstream utilization. <br><br><br><h4>Acknowledgements</h4><br>I thank Matt for the Doom core load and COD MW2 images, and for his constructive criticism on a former version of this article.Juan Ramón González Álvarezhttps://plus.google.com/109139504091371132555noreply@blogger.comtag:blogger.com,1999:blog-3564535652306217326.post-83685130444405771462018-09-11T11:01:00.001-07:002018-10-05T17:40:24.197-07:00Extreme scale general-purpose processorResearch in progress. <br><br>Target: <pre><br />5nm<br />>60 TFLOPS<br />Direct graphics rendering (no API)<br /></pre><br>Core pipeline: <pre><br /> ________<br /> | |<br /> | V<br />[IF] [BT] [BO] [ID] [EX] [MEM] [WB]<br />|__| |__________________|<br />RISC VLIW<br /><br />BT = Binary Translator<br />BO = Binary Optimizer<br /></pre><br>Dies configuration: <br><pre><br />|CCCC||MMMM||CCCC|<br />|CCCC||MMMM||CCCC|<br />|CCCC||MMMM||CCCC|<br />|CCCC||MMMM||CCCC|<br /><br />M = memory<br />C = compute<br /></pre><br>Memory: <br><pre><br />NVRAM···CHIP···NVRAM</pre>Software pre-scheduling (optional) <br><br>In the original approach the BT+BO stages extract ILP from ordinary RISC-like sequence of instructions. Additionally we consider moving part of the hardware to the compiler. In this alternative design the compiler extracts ILP and produces abstract pseudo-VLIW bundles of arbitrary instructions with bundle size explicitly marked by the compiler with NOPs, like in the next example <pre>add $r13 = $r3, $r0<br />sub $r16 = $r6, 3<br />;;<br />shl $r13 = $r13, 3<br />shr $r15 = $r15, 9<br />ld.w $r14 = 0 [$r4]<br />;;<br /><br /><br />;; = NOP<br /></pre>Then a simplified core detects the NOPs and builds the VLIW bundle corresponding to the machine execution model. This simplified core basically needs only a BT stage and moves most of the BO stage to the compiler, simplifying the hardware even more. <br><br>Latency: <br><br>The BO stage provides tolerance to instruction execution latencies. Runahead mode is used to tolerate cache miss latencies. Runahead with reuse is being evaluated. The goal is to provide OoO performance with nearly IO hardware complexity. During runahead mode the BO stage is shutdown and pipeline works with the bypass [BT] --> [ID] <br><br>Branches: <br><br>Multipath execution for hard-to-predict branches. <br><br>Execution engine: <br><br>Two VLIW configurations are being evaluated: 4-wide and 8-wide <pre><br />[branch] [int] [mem] [float]<br /><br />[branch] [int] [int] [int] [mem] [mem] [float] [float]<br /></pre><hr><br><br>Comparison with other approaches: <br><pre><br /> RISC --> VLIW CISC --> VLIW VLIW --> VLIW<br />-----------------------------------------------------------------------------<br />Transmeta Static (software) Dynamic (software)<br />Denver Dynamic (hardware) Dynamic (software)<br />This Dynamic (hardware) Dynamic (hardware)<br /></pre><br><br>Comparison with superscalar: <br><br>This approach has two advantages over superscalar: efficiency and modularity. <br><br>The VLIW part of the pipeline is much simpler than supercalar pipeline of the same wide. We are here talking about one half or a third of the complexity of the superscalar approach; even the decode stage on a VLIW is simpler than the decode stage on superscalar. Fetch stages are similar, the binary translation stage in the new design is rather simple, all the complexity is hidden in the binary optimizer stage, but the new design allows a modular approach. <br><br>The ILP and OoO logic on a superscalar core work on uops, whereas the binary optimizer in the new design works on the target ISA instructions. This means that the optimizer has a synergy with the compiler; it is possible to move optimizations from the compiler to the core and backwards, finding the optimal hardware/software design, unlike on an superscalar approach where compiler and the superscalar logic are decoupled. <pre><br />Configuration A: Base code --> Optimization 1 --> Optimization 2 --> Optimization 3 --> Executing code<br /> |____________| |_________________________________________________|<br /> Compiler Hardware<br /><br />Configuration B: Base code --> Optimization 1 --> Optimization 2 --> Optimization 3 --> Executing code<br /> |_______________________________| |________________________________|<br /> Compiler Hardware<br /><br />Configuration C: Base code --> Optimization 1 --> Optimization 2 --> Optimization 3 --> Executing code<br /> |__________________________________________________| |____________|<br /> Compiler Hardware<br /></pre><br>It is also possible to segment the hardware optimizations and apply them in a modular way depending of different factors such as latency limits, power consumption, complexity of the code, and so on. This can be understodd as a hardware version of the On flags on a compiler. A basic modularization is shown in the pipeline, where the whole BT stage is bypassed after a cache miss, but more complex bypasses can be envisioned <pre><br /> _____________________<br /> | ________________|<br /> | | __________|<br /> | | | ____|<br /> | | | | |<br /> | | | | V<br />[IF] [BT] [BO1] [BO2] [BO3] [ID] [EX] [MEM] [WB]<br /></pre><br><br>This core can be thought as a hybrid core between superscalar and VLIW. Modifying the design point or the runtime parameters, the core can perform more like a superscalar or like a VLIW. E.g. if we remove most of the BO stage and execute the above alternative code "Software pre-scheduling (optional)", then the core would just work as a compressed VLIW. My goal is to place the design point closer to VLIW than to superscalar. <pre>superscalar <··················[·]········> VLIW</pre>Juan Ramón González Álvarezhttps://plus.google.com/109139504091371132555noreply@blogger.comtag:blogger.com,1999:blog-3564535652306217326.post-5449726350992628982018-09-08T11:04:00.000-07:002018-10-05T17:49:44.659-07:00Next article draft* It seems you use compiler to extract ILP beyond basic blocks (trace level? hyperblocks?) and generate variable length bundles. I assume you use stop bits to mark the end of each VLIW bundle. Which is the difference with a compressed VLIW? <br><br>* What approach is used for solving the problem of hard-to-predict branches? <br><br>* Do bundles correspond to predefined VLIW slots for a fixed machine: i.e. [Branch] [Compare] [MADD] [MADD] [LOAD] ··· or are abstract bundles of independent microops? [uop1] [uop2] [uop3] [uop4] [uop5] ···? <br><br>* "Sustained up to 8 microops" seems the max ILP. Which is the sustained ILP for general code as SPEC suite: ILP ~ 1.72? <br><br>* The core seems to have a non-stalling in-order pipeline. The mention to poisson bits seems to indicate some kind of runahead mode. Can you confirm you use runahead during cache stalls? Do all instructions executed under runahead mode are reexecuted again during normal mode or only the poissoined instructions are reexecuted? <br><br>* One slide claims 32MB combined L2+L3 size, but another slide claims 256KB+512KB per core. So I think 32MB is only the L3 size. Are you using exclusive cache policies? <br><br>* What does "very short wires mitigating the slow wires problem" mean? Reducing long-wires? <br><br>* You claim that Itanium was stalled 50% of the time and Prodigy achieves less than 20% stall. However one slide shows 69% unstalled percentage for Prodigy. You claim normal OoO stalls 15% of time. I have difficulties to accept this value. Modern state-of-the-art OoO cores are stalled most of time. <br><br>* Do cores have boost policies or run at fixed 4GHz?Juan Ramón González Álvarezhttps://plus.google.com/109139504091371132555noreply@blogger.comtag:blogger.com,1999:blog-3564535652306217326.post-59401967809964074212018-08-13T10:42:00.000-07:002018-08-13T11:46:01.808-07:00What I would like for IceLake and Zen2... but is not happeningAMD and Intel have their own plans for their forthcoming products, and we have to accept their plans and purchase future hardware choosing among what both companies will offer us in the next years. My vision is different, as I wrote in a pair of recent tweets. This is what I would like for Icelake. <br><br><H4>Intel</H4><br>AVX512 hardware occupies too much space: register files, caches, datapaths, and big execution units. AVX512 is important in HPC and servers, but it is less important gamers and mainstream customers. Even in HPC/servers, I don't think that wide SIMD is the optimal design point because of the divergence problem. <br><br>Personally I would eliminate both 512bit and 256bit hardware and make a more narrow core. The core I propose would measure about 4mm<sup>2</sup> on Intel 10nm. This is about <i>one half</i> the size of Skylake client core. This thinning of the core has two advantages. First, the ability to get higher clocks, because the max frequency achievable by a core is a function of its size; do not expect miracles but some hundred of MHz extra are welcomed. Second, the possibility of adding more cores in the same die size; that is in the same cost and power slots. Who do not prefer more cores for general code rather than AVX512 hardware working only in half a dozen of customers workloads and zero games? <br><br>I would also like to see iGPU optional. Currently the mainstream client platform from Intel includes integrated graphics, whereas the enthusiast platform does not include one. Give people more choice! Let graphics be optional on both platforms. And for reducing costs use a multidie approach around EMIB. Below I add a illustration of the concept. This is a dual-die configuration. The SoC on top is made of two CPU dies for a total of eight CPU cores. The SoC at bottom has half the CPU cores replaced by three GPU cores. <br><br><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-aLasBY5Ys08/W3G3eLLkI6I/AAAAAAAABAU/BcJuENStdC4J347yofv_0GIklDfp03eMACLcBGAs/s1600/CPU-CPU_GPU-CPU_concept.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://3.bp.blogspot.com/-aLasBY5Ys08/W3G3eLLkI6I/AAAAAAAABAU/BcJuENStdC4J347yofv_0GIklDfp03eMACLcBGAs/s1600/CPU-CPU_GPU-CPU_concept.gif" style="background:#ebebeb" width="400"/></a></div><br><br><H4>AMD</H4><br>AMD Zen core already lacks 256bit and 512bit SIMD units, but it comes on a weird 4-core CCX complex configuration with disjoint L3 caches. I would like both CCX in the Zeppelin die to be combined into a single module with eight cores sharing unified L3 cache. <br><br><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-h4KbDwe-Zjs/W3G7bDI4iSI/AAAAAAAABAg/YwmwNvEwtVEN82LS46eJWCfWPN3hH35LwCLcBGAs/s1600/8-core%2BCCX%2BRyzen.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://4.bp.blogspot.com/-h4KbDwe-Zjs/W3G7bDI4iSI/AAAAAAAABAg/YwmwNvEwtVEN82LS46eJWCfWPN3hH35LwCLcBGAs/s1600/8-core%2BCCX%2BRyzen.gif" width="400"/></a></div>Juan Ramón González Álvarezhttps://plus.google.com/109139504091371132555noreply@blogger.comtag:blogger.com,1999:blog-3564535652306217326.post-35945261854456505632017-07-04T04:30:00.001-07:002017-07-04T04:51:41.828-07:00Renormalized field theory from particle theoryContrary to conventional wisdom that pretends that particle theory has been disproved and replaced by field theory, we demonstrate that the Hamiltonian particle theory gives an improved version of the usual field-theoretic Hamiltonian, without the difficulties associated to divergences. Moreover, the particle theory is formulated in base to physical charges and masses of particles, whereas field theory only can use unphysical bare quantities. <br><br>Newton introduced a model of instantaneous direct interactions among massive particles. This model was latter replicated by Coulomb for charged particles and became known as action-at-a-distance; an unfortunate name has generated unending polemics among physicists and philosophers. A better name is <b>direct-particle-interaction</b>. Physicists, dissatisfied with the action-at-distance model of interactions, introduced a contact-action model based in fields. The mystery on how one particle interacts with other particle without a mediator, \[ \mathrm{particle} \Longleftrightarrow \mathrm{particle} , \] was replaced by a new pair of mysteries: how do particles and fields interact without a mediator? \[ \mathrm{particle} \Longleftrightarrow \mathrm{field} \Longleftrightarrow \mathrm{particle} \] But doubling the number of mysteries was not the only problem. Field theory introduced a broad collection of difficulties; divergences and violations of causality and conservation laws are mentioned often in the literature, but we also have the introduction in the formalism of unobservable systems with an infinite number of degrees of freedom, the introduction of the unphysical bare particles and virtual particles, or the inherent time-asymmetry. General relativity came to only make the situation even worse; General relativity is not an ordinary field theory, because the role of the gravitational field is replaced by a curved spacetime, which introduces further difficulties atop ordinary field theories. Of course, the same physicists that claim that the old action-at-distance model of Newtonian gravity was wmisterious, don't even bother to ask how a massive body curves spacetime or how another body detects the curvature of the spacetime and reacts to it by moving in a different way. <br><br>The conventional wisdom is that action-at-a-distance was disproved by experiment in the 19th century, but when one checks the sources for those bold claims one finds that their authors are ignoring the physical and mathematical differences between field theories and General Relativity and the theories of Coulomb and Newton. Consider a standard highly-considered textbook <b>[1]</b>; you can see therein that Steven Weinberg claims on the section 8.3 that the field-theoretic quantity \[ V_\mathrm{field} = \frac{1}{2} \int \mathrm{d}^3 \boldsymbol{x} \int \mathrm{d}^3 \boldsymbol{y} \frac{\rho(\boldsymbol{x})\rho(\boldsymbol{y})}{4\pi\epsilon_0 |\boldsymbol{x}-\boldsymbol{y}|} \] is "<i>the familiar Coulomb energy</i>". His claim is <b>not</b> correct. First, the above expression is fully static, there is no time-dependence, whereas the true Coulomb energy depends on time <i>implicitly</i> via the positions of particles as \( V_\mathrm{Coulomb}(\{\boldsymbol{r}_i(t)\}) \). Second, the expression given by Weinberg is infinite whereas the true Coulomb energy is finite. There are other differences between (3) and the true Coulomb energy \( V_\mathrm{Coulomb} \), but they are more subtle and beyond the scope of this article. <br><br>In this article, we will demonstrate how particle theory yields an improved version of field theory. We will restrict the discussion to electromagnetism. We will start with the next energy for an isolated system of \(N\) charges moving at low velocities \[ E = \sum_i \frac{\boldsymbol{p}_i^2}{2m_i} + \frac{1}{2} \sum_i \sum_{j\neq i} \frac{e_ie_j}{4\pi\epsilon_0 | \boldsymbol{r}_i - \boldsymbol{r}_j |} \Big( 1 - \frac{\boldsymbol{p}_i\boldsymbol{p}_j}{m_im_jc^2} \Big) .\] Next we introduce the particle potentials \[ \phi_i \equiv \sum_{j\neq i} \frac{e_j}{4\pi\epsilon_0 | \boldsymbol{r}_i - \boldsymbol{r}_j |} \] \[ \boldsymbol{A}_i \equiv \sum_{j\neq i} \frac{e_j}{4\pi\epsilon_0 | \boldsymbol{r}_i - \boldsymbol{r}_j |} \frac{\boldsymbol{p}_j}{m_jc} ,\] and write a more concise expression for the energy \[ E = \sum_i \Big( \frac{\boldsymbol{p}_i^2}{2m_i} + \frac{1}{2} e_i \phi_i - \frac{1}{2} \frac{e_i\boldsymbol{p}_i}{m_ic} \boldsymbol{A}_i \Big) . \] Field theory is formulated over a \(4D\) spacetime background instead of over a \(6N\) phase space; as a consequence, velocities acquire a more relevant status than momenta. In a first step to derive field theory from particle theory, we need to replace momenta by velocities. Using Hamiltonian equations we can obtain the velocities \[ \boldsymbol{v}_i = \frac{\boldsymbol{p}_i}{m_i} - \sum_{j\neq i} \frac{e_ie_j}{4\pi\epsilon_0 | \boldsymbol{r}_i - \boldsymbol{r}_j |} \frac{\boldsymbol{p}_j}{m_im_jc^2} = \frac{\boldsymbol{p}_i}{m_i} - \frac{e_i}{m_ic} \boldsymbol{A}_i \] to obtain \[ E = \sum_i \Big( \frac{m_i\boldsymbol{v}_i^2}{2} + \frac{1}{2} e_i \phi_i + \frac{1}{2} \frac{e_i\boldsymbol{v}_i}{c} \boldsymbol{A}_i \Big) .\] This expression depends on \(2N\) potentials, whereas classical electrodynamics only deals with a pair of potentials. The reduction is achieved by eliminating the constraint \(j\neq i\). For instance, for the scalar potential, \[ \sum_{j\neq i} \frac{e_j}{4\pi\epsilon_0 | \boldsymbol{r}_i - \boldsymbol{r}_j |} = \sum_j \frac{e_j}{4\pi\epsilon_0 | \boldsymbol{r}_i - \boldsymbol{r}_j |} - \frac{e_i}{4\pi\epsilon_0 | \boldsymbol{r}_i - \boldsymbol{r}_i |} ,\] which produces \[ \phi_i = \phi - \phi_i^\mathrm{self} ,\] with a similar expression for the vector potential. The energy is now \[ E = \sum_i \frac{m_i\boldsymbol{v}_i^2}{2} + \frac{1}{2} \sum_i \Big[ e_i \Big( \phi - \phi_i^\mathrm{self} \Big) + \frac{e_i\boldsymbol{v}_i}{c} \Big( \boldsymbol{A} - \boldsymbol{A}_i^\mathrm{self} \Big) \Big] .\] The new potentials \(\phi\) and \(\boldsymbol{A}\) diverge, but \(\phi_i^\mathrm{self}\) and \(\boldsymbol{A}_i^\mathrm{self}\) eliminate this divergence and the total energy \(E\) continues being a finite quantity. We will see below that the lack of those \(N\) correction terms in field theory is the reason why field theory predicts nonsensical infinite energies. <br><br>The above expression can be re-organized as \[ E = \frac{1}{2} \sum_i \Big( m_i\boldsymbol{v}_i^2 - \frac{e_i\boldsymbol{v}_i}{c} \boldsymbol{A}_i^\mathrm{self} \Big) + \frac{1}{2} \sum_i \Big( e_i \phi + \frac{e_i\boldsymbol{v}_i}{c} \boldsymbol{A} \Big) - E^\mathrm{self} ,\] where \( E^\mathrm{self} \equiv \frac{1}{2} \sum_i e_i \phi_i^\mathrm{self} \). Multiplying both sides of (8) with \(m_i\), replacing \(\boldsymbol{A}_i\) by \((\boldsymbol{A} - \boldsymbol{A}_i^\mathrm{self})\) and reorganizing the result gives \[ \Big( m_i \boldsymbol{v}_i - \frac{e_i}{c} \boldsymbol{A}_i^\mathrm{self} \Big) = \boldsymbol{p}_i - \frac{e_i}{c} \boldsymbol{A} .\] Multiplying again both sides by \( \boldsymbol{v}_i\) and iterating finally produces the identity \[ \Big( m_i \boldsymbol{v}_i^2 - \frac{e_i\boldsymbol{v}_i}{c} \boldsymbol{A}_i^\mathrm{self} \Big) = \Big( \boldsymbol{p}_i - \frac{e_i}{c} \boldsymbol{A} \Big) \frac{1}{m_i - m_i^\mathrm{self}} \Big( \boldsymbol{p}_i - \frac{e_i}{c} \boldsymbol{A} \Big) ,\] where \(m_i^\mathrm{self} \equiv (e_i / \boldsymbol{v}_i c) \boldsymbol{A}_i^\mathrm{self} \), which is evidently a divergent quantity, sometimes named the "<i>self-mass</i>" or the "<i>electromagnetic mass</i>" in the field-theoretic literature. Replacing \( \boldsymbol{A}_i^\mathrm{self}\) by its value, expanding it on a power series on \((v/c)\) and retaining only the first term in the series we obtain \[ m_i^\mathrm{self} = \frac{e_i^2}{4\pi\epsilon_0 |\boldsymbol{r}_i -\boldsymbol{r}_i| c^2} ,\] a result first obtained by Dirac. The above identity (15) can be used on (13) to give \[ E = \sum_i \frac{1}{2} \frac{1}{m_i - m_i^\mathrm{self}} \Big( \boldsymbol{p}_i - \frac{e_i}{c} \boldsymbol{A} \Big)^2 + \frac{1}{2} \sum_i \Big( e_i \phi + \frac{e_i\boldsymbol{v}_i}{c} \boldsymbol{A} \Big) - E^\mathrm{self} \] Finally using densities \(\rho(\boldsymbol{z},t) \equiv \sum_i e_i \delta(\boldsymbol{z} - \boldsymbol{r}_i(t))\) and currents \(\boldsymbol{j}(\boldsymbol{z},t) \equiv \sum_i e_i \boldsymbol{v}_i\delta(\boldsymbol{z} - \boldsymbol{r}_i(t))\) and combining both masses on a new concept of mass as \(m_i^\mathrm{bare} \equiv m_i - m_i^\mathrm{self}\) yields \[ E = \sum_i \frac{1}{2m_i^\mathrm{bare}} \Big( \boldsymbol{p}_i - \frac{e_i}{c} \boldsymbol{A} \Big)^2 + \frac{1}{2} \int \mathrm{d}V \Big( \rho \phi + \boldsymbol{j} \boldsymbol{A} \Big) - E^\mathrm{self} .\] The integral can be written in an alternative form, obtaining the final result as function of electric \(\boldsymbol{E}\) and magnetic fields \(\boldsymbol{B}\) \[ E = \sum_i \frac{1}{2m_i^\mathrm{bare}} \Big( \boldsymbol{p}_i - \frac{e_i}{c} \boldsymbol{A} \Big)^2 + \frac{1}{8\pi} \int \mathrm{d}V \Big( \boldsymbol{E}^2 + \boldsymbol{B}^2 \Big) - E^\mathrm{self} .\] Except by the correction term \(E^\mathrm{self}\), this expression is fully analogous to the one proposed by field theory. The difference being in the physical interpretation; the integral does not represent a separate system, the field, with its own degrees of freedoms; but the integral simply represents a component of the full interaction between a system of particles. <br><br>We also can split the electric field into longitudinal and transversal components \( \boldsymbol{E}^2 = \boldsymbol{E}_\mathrm{L}^2 + \boldsymbol{E}_\mathrm{T}^2\) and then use the relation \[ \frac{1}{2} \int \boldsymbol{E}_\mathrm{L}^2 \mathrm{d}V = \frac{1}{2} \sum_i\sum_{j\neq i} \frac{e_i e_j}{4\pi\epsilon_0 | \boldsymbol{r}_i - \boldsymbol{r}_j |} + E^\mathrm{self} \] to write the energy in the alternative form \[ E = \sum_i \frac{1}{2m_i^\mathrm{bare}} \Big( \boldsymbol{p}_i - \frac{e_i}{c} \boldsymbol{A} \Big)^2 + \frac{1}{2} \sum_i\sum_{j\neq i} \frac{e_i e_j}{4\pi\epsilon_0 | \boldsymbol{r}_i - \boldsymbol{r}_j |} + \frac{1}{8\pi} \int \mathrm{d}V \Big( \boldsymbol{E}_\mathrm{T}^2 + \boldsymbol{B}^2 \Big) .\] Note that field theory is lacking the \(E^\mathrm{self}\) term in (19); therefore physicists only can obtain (21) when using the next <b>nonsensical</b> expression \[ \frac{1}{2} \int \boldsymbol{E}_\mathrm{L}^2 \mathrm{d}V = \frac{1}{2} \sum_i\sum_j \frac{e_i e_j}{4\pi\epsilon_0 | \boldsymbol{r}_i - \boldsymbol{r}_j |} = \frac{1}{2} \sum_i\sum_{j\neq i} \frac{e_i e_j}{4\pi\epsilon_0 | \boldsymbol{r}_i - \boldsymbol{r}_j |} \] This has the same validity than writing \((\infty = 6)\). Despite being nonsensical, the expression (22) is found in the mainstream literature; it is equation 1.57 in textbook <b>[2]</b>. <br><br>We have demonstrated that particle theory is superior to field theory, and that field theoretic expressions such as (3) cannot even reproduce ancient Coulomb energy; not at least without using certain amounts of mathematical funambulism, such as making \((\infty-\infty)\) equal to the value required by experiment or pretending that \(\Phi(R(t)) = \Phi(\boldsymbol{r},t)\). <br><br><h4>References and Notes</h4><br><b>[1]</b> The Quantum Theory of Fields, Volume I <b>1996:</b> <i>Cambridge University Press; Cambridge.</i> Weinberg, Steven. <br><br><b>[2]</b> Quantum Field Theory, Revised Edition <b>1999:</b> <i>John Wiley & Sons Ltd.; Chichester.</i> Mandl, F.; Shaw, G. <br><br>Juan Ramón González Álvarezhttps://plus.google.com/109139504091371132555noreply@blogger.comtag:blogger.com,1999:blog-3564535652306217326.post-42742308927041860272017-06-02T06:10:00.002-07:002017-06-03T04:56:48.591-07:00Renormalization as a direct-particle-action correction to field theoryOne of the notorious difficulties with field theory is its prediction of nonsensical infinite values for physical properties such as energy. Renormalization is a procedure by which divergent parts of a calculation, leading to the nonsensical infinite results, are absorbed by redefinition into a few measurable quantities, so yielding finite answers. This work shows that renormalization counterterms added to field-theoretic Hamiltonians and Lagrangians are a consequence of direct-particle-action corrections to field theory. Some widespread misunderstandings are also corrected. <br><br>Most physicists ignore the physical and mathematical differences between direct-particle-actions on Coulomb and Newton theories and contact-actions in field theories and General Relativity. If you check standard textbooks, as the one by Steven Weinberg <b>[1]</b>, you can see that he claims on the section 8.3 that the field-theoretic quantity \[ V_\mathrm{field} = \frac{1}{2} \int \mathrm{d}^3 \boldsymbol{x} \int \mathrm{d}^3 \boldsymbol{y} \frac{\rho(\boldsymbol{x})\rho(\boldsymbol{y})}{4\pi\epsilon_0 |\boldsymbol{x}-\boldsymbol{y}|} \] is "<i>the familiar Coulomb energy</i>". His claim is <b>not</b> correct. First, the above expression is static, whereas the true Coulomb energy depends on time <b>implicitly</b> via particle positions as \( V_\mathrm{Coulomb}(\boldsymbol{r}_1(t),\boldsymbol{r}_2(t)) \). Second, the expression given by Weinberg is infinite whereas the true Coulomb energy is finite. There are other differences, but they are more subtle and not of interest for the goals of this article. <br><br>If you check the revision of classical electrodynamics given in the section 1.4.1 of the textbook by Mandl and Shaw <b>[2]</b>, you can see that, on contrast with Weinberg, the pair of authors allow the integral to carry out an <b>explicit</b> time dependence \[ \frac{1}{2} \int \boldsymbol{E}_\mathrm{L}^2 \mathrm{d}^3 \boldsymbol{x} = \frac{1}{2} \int \mathrm{d}^3 \boldsymbol{x} \int \mathrm{d}^3 \boldsymbol{y} \frac{\rho(\boldsymbol{x},t)\rho(\boldsymbol{y},t)}{4\pi\epsilon_0 |\boldsymbol{x}-\boldsymbol{y}|} . \] Introducing the usual charge density \( \rho(\boldsymbol{z},t) = \sum_i e_i \delta(\boldsymbol{z} - \boldsymbol{r}_i(t)) \), Mandl and Shaw write \[ \frac{1}{2} \int \boldsymbol{E}_\mathrm{L}^2 \mathrm{d}^3 \boldsymbol{x} = \frac{1}{2} \sum_i\sum_{j} \frac{e_i e_j}{4\pi\epsilon_0 |\boldsymbol{r}_i -\boldsymbol{r}_j|} = \frac{1}{2} \sum_i\sum_{j\neq i} \frac{e_i e_j}{4\pi\epsilon_0 |\boldsymbol{r}_i -\boldsymbol{r}_j|} ,\] with they stating that the last expression, the one where they have "<i>dropped the infinite self-energy which occurs for point charges</i>" is "<i>the Coulomb interaction</i>". Ignoring subtle issues arising from the difference between time-implicit and time-explicit dependences <b>[3]</b>, their equation is so meaningless as writing \( (\infty = 7) \). <br><br>By simplicity we will work in the classical and nonrelativistic regime. The energy with <a href="http://www.juanrga.com/2016/07/instantaneous-electromagnetic.html">direct-particle interactions</a> is given in this limit by \[ E = \sum_i \frac{\boldsymbol{p}_i^2}{2 m_i} + \frac{1}{2} \sum_i\sum_{j\neq i} \frac{e_i e_j}{4\pi\epsilon_0 |\boldsymbol{r}_i -\boldsymbol{r}_j|} \left( 1 - \frac{\boldsymbol{p}_i \boldsymbol{p}_j}{m_i m_j c^2} \right) ,\] which is a finite quantity, because self-interactions are dropped by \( j\neq i \). With the help of a Kronecker delta we can rewrite the summation in a more symmetrical way \[ E = \sum_i \frac{\boldsymbol{p}_i^2}{2 m_i} + \frac{1}{2} \sum_i\sum_j \frac{e_i e_j}{4\pi\epsilon_0 |\boldsymbol{r}_i -\boldsymbol{r}_j|} \left( 1 - \frac{\boldsymbol{p}_i \boldsymbol{p}_j}{m_i m_j c^2} \right) \left( 1 - \delta_{ij} \right) .\] Of course, this continues being a finite quantity. Partially reorganizing the expression yields \[ E = \frac{1}{2} \sum_i \left( \frac{\boldsymbol{p}_i^2}{m_i} + \frac{e_i^2}{4\pi\epsilon_0 |\boldsymbol{r}_i -\boldsymbol{r}_i|} \frac{\boldsymbol{p}_i^2}{m_i^2 c^2} \right) + \frac{1}{2} \sum_i\sum_j \frac{e_i e_j}{4\pi\epsilon_0 |\boldsymbol{r}_i -\boldsymbol{r}_j|} \left( 1 - \frac{\boldsymbol{p}_i \boldsymbol{p}_j}{m_i m_j c^2} \right) + V_\mathrm{countertem} ,\] with the divergent counterterm defined by \[ V_\mathrm{countertem} = - \frac{1}{2} \sum_i \frac{e_i^2}{4\pi\epsilon_0 |\boldsymbol{r}_i -\boldsymbol{r}_i|} .\] Finally using densities and currents we obtain \[ E = \frac{1}{2} \sum_i \left( \frac{\boldsymbol{p}_i^2}{m_i} + \frac{e_i^2}{4\pi\epsilon_0 |\boldsymbol{r}_i -\boldsymbol{r}_i|} \frac{\boldsymbol{p}_i^2}{m_i^2 c^2} \right) + \frac{1}{2} \int \mathrm{d}^3 \boldsymbol{x} \int \mathrm{d}^3 \boldsymbol{y} \frac{\rho(\boldsymbol{x},t)\rho(\boldsymbol{y},t)}{4\pi\epsilon_0 |\boldsymbol{x}-\boldsymbol{y}|} - \int \mathrm{d}^3 \boldsymbol{x} \; \boldsymbol{j}(\boldsymbol{x},t)\boldsymbol{A}(\boldsymbol{x},t) + V_\mathrm{countertem} .\] Although some individual terms are divergent, this overall expression for the energy continues being finite and equivalent to the starting expression (4). Field theory ignores the counterterm, introduces a divergent self-mass concept first obtained by Dirac \[ m_i^\mathrm{self} = \frac{e_i^2}{4\pi\epsilon_0 |\boldsymbol{r}_i -\boldsymbol{r}_i| c^2} \] and reabsorbs the difference between the real mass \( m_i \) and the self-mass by introducing still another concept of divergent mass, the bare mass \( m_i^\mathrm{bare} \), to get the field-theoretic expression \[ E_\mathrm{field} = \frac{1}{2} \sum_i \frac{\boldsymbol{p}_i^\mathrm{bare}\boldsymbol{p}_i^\mathrm{bare}}{m_i^\mathrm{bare}} + \frac{1}{2} \int \mathrm{d}^3 \boldsymbol{x} \int \mathrm{d}^3 \boldsymbol{y} \frac{\rho(\boldsymbol{x},t)\rho(\boldsymbol{y},t)}{4\pi\epsilon_0 |\boldsymbol{x}-\boldsymbol{y}|} - \int \mathrm{d}^3 \boldsymbol{x} \; \boldsymbol{j}(\boldsymbol{x},t)\boldsymbol{A}(\boldsymbol{x},t) ,\] which is not only infinite, but uses the unphysical concept of bare mass. Comparing the expressions (4) and (10), we can easily obtain the fundamental relation between the true Coulomb energy and the field-theoretic expression —for many applications \( \rho(\boldsymbol{z};t) \) <b>[3]</b> can be safely replaced by \( \rho(\boldsymbol{z},t) \) because the functional difference is harmless— \[ \frac{1}{2} \sum_i\sum_{j\neq i} \frac{e_i e_j}{4\pi\epsilon_0 |\boldsymbol{r}_i -\boldsymbol{r}_j|} = \frac{1}{2} \int \mathrm{d}^3 \boldsymbol{x} \int \mathrm{d}^3 \boldsymbol{y} \frac{\rho(\boldsymbol{x},t)\rho(\boldsymbol{y},t)}{4\pi\epsilon_0 |\boldsymbol{x}-\boldsymbol{y}|} - \frac{1}{2} \sum_i \frac{e_i^2}{4\pi\epsilon_0 |\boldsymbol{r}_i -\boldsymbol{r}_i|} \] or, using a more concise notation, \[ V_\mathrm{Coulomb} = V_\mathrm{field} + V_\mathrm{countertem} .\] Of course the infinite counterterm exactly cancels the divergence within the field-theoretic expression giving a finite Coulomb energy in agreement with experiments. <b>We have derived from the direct-particle interaction the field-theoretic interaction plus a renormalization counterterm</b>. The existence of this counterterm is a physical consequence of the fact that the interaction between charged particles cannot be fully described in terms of interacting fields \( \rho(\boldsymbol{x},t)\phi(\boldsymbol{x},t) \). The origin of this impossibility can be traced back to the Coulomb interaction describing the physics of the correlation between particles, and this correlation is beyond a product of spacetime densities \( \rho(\boldsymbol{x},t)\rho(\boldsymbol{y},t) \). Precisely, the absence of correlations in the field-theoretic approach is the ultimate reason why a physical field is required to carry out the interaction between particles, whereas no field is needed in direct-particle-action theories. <br><br>The same line or reasoning can be applied in a quantum context, except that then one obtains also a renormalization of electron charge and a final set of quantum electrodynamic expressions in terms of bare masses and bare charges, both unphysical and different from real masses and charges. Using direct-particle gravitational interactions we can obtain general relativity expression plus renormalization corrections. <br><br><br><h4>References and Notes</h4><br><b>[1]</b> The Quantum Theory of Fields, Volume I <b>1996:</b> <i>Cambridge University Press; Cambridge.</i> Weinberg, Steven. <br><br><b>[2]</b> Quantum Field Theory, Revised Edition <b>1999:</b> <i>John Wiley & Sons Ltd.; Chichester.</i> Mandl, F,; Shaw, G. <br><br><b>[3]</b> Sometimes the density is better written like \( \rho(\boldsymbol{z};t) \) with the semicolon indicating a time-implicit dependence through the particles positions \( \rho(\boldsymbol{z};t) = \rho(\boldsymbol{z},\{\boldsymbol{r}_i(t)\}) \). Juan Ramón González Álvarezhttps://plus.google.com/109139504091371132555noreply@blogger.comtag:blogger.com,1999:blog-3564535652306217326.post-37606612966250889802017-03-04T02:12:00.003-08:002017-03-13T03:18:54.107-07:00SymplecticsThe neologism <i>symplectics</i> deriving from <i>symplektikos</i> the Ancient Greek <i style="font-size:95%;" lang="grc">συμπλεκτικός</i> —combination of <i style="font-size:95%;" lang="grc">συμ</i> «sum» and <i style="font-size:95%;" lang="grc">πλεκτικός</i> «interlacing»— and literally meaning «braided together» or «intertwined» is the new term I have coined to designate a unified and self-consistent approach to understanding the Natural World. Symplectics is based in logic and measurements and can cover both microscopic and macroscopic scales, living and non-living matter, simple and complex phenomena, scientific and philosophical knowledge.Juan Ramón González Álvarezhttps://plus.google.com/109139504091371132555noreply@blogger.comtag:blogger.com,1999:blog-3564535652306217326.post-43231082077612153582017-02-28T04:07:00.001-08:002018-04-22T05:30:19.704-07:00AMD RyZen predictionsIt is that epoch of the year where I compare my predictions with reality. It worked well in the past with typical errors in the single digit percent. Let us check if the trend continues for the predictions I did for RyZen, despite this is a brand new microarchitecture instead an evolution of an existent microarchitecture. Note that most of my predictions are from years 2104 and 2015. <br><br>You can find lots of false claims about my predictions in the Internet; you can find even a funny guy that pretends that I had said that Zen could not pass the 2GHz. I will do some remarks about those claims below, meanwhile you can find a summary of my predictions in this <strike>post</strike> (older forum has gone). A 2016 update on my prediction of 8-core die size can be found in this other <strike>post</strike> (older forum has gone). <br><br>I predicted that Ryzen --then known as Zen-- would be made in 14LPP process at Globalfoundries. I predicted Zen was going to be a small core with a size of "~5mm²" (without L2) on the 14LPP process. This was confirmed by AMD at ISSCC talk this month. Zen measures 5.5mm². It is worth to mention that the symbol "~" means "around" or "approximately equal to". Effectively, Zen is a small core because, as mentioned by Fottemberg in this <strike>post</strike> (older forum has gone): "<i>Usually, a Small Core is a core about 5-10 mm2 or smaller, and a Big Core is a core about 20 mm2 o bigger</i>". The prediction of ~205mm² for the die size was also rather accurate, within a 4% of error, because the die measures 212.97mm² according to <a href="http://pc.watch.impress.co.jp/img/pcw/docs/1047/507/13.png">Hiroshige Goto</a>. <br><br>My SMT2 prediction was also confirmed. Zen has 2-way SMT for multithreading. I also predicted that SMT yields on Zen would be bigger than for Intel because the Zen microarchitecture is more distributed: separate integer and FP clusters, more execution pipes, and non-unified scheduling. This is not about having a better SMT implementation than Intel, but about Intel microarchitecture extracting more ILP from a single thread and, thus, leaving less empty execution slots in the core ready for a second thread. I even offered the next example; it is oversimplified but enough to get my point. Imagine two cores A and B, both are 6-wide and both have the same SMT implementation; however, core A has deeper SS/OoO logic and can sustain an ILP of 4, whereas core B only can sustain an ILP of 3. <br><br>When executing a single thread the core A has more throughput --four vs three for core B--. because it can execute more instructions per cycle. But when executing a second thread, both cores have the same throughout of six instructions per cycle because a second thread can fill the unused execution resources on both cores. <br><br><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-ggnNDD6__KM/WLcVJACQgRI/AAAAAAAAAeg/d970Fp3eA5kdL12EpnwayNs9O1heNHQQwCLcB/s1600/MCM4_SoC.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><a href="https://4.bp.blogspot.com/-O4PyWMXMztY/WLcU5vf3dfI/AAAAAAAAAec/DISwHGECzAodRR8eKSnH_0jRFUi8bUS-ACLcB/s1600/SMT%2Bvs%2BILP.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-O4PyWMXMztY/WLcU5vf3dfI/AAAAAAAAAec/DISwHGECzAodRR8eKSnH_0jRFUi8bUS-ACLcB/s600/SMT%2Bvs%2BILP.png" width="600" height="594" /></a></a></div><br><br>Other predictions for the core like being 6-wide and 2x128 FMAC units were confirmed. However, I predicted (3 ALU + 3 AGU) for the integer part and Zen is (4 ALU + 2 AGU). As mentioned once by David Kanter in <a href="http://www.realworldtech.com/forum/?threadid=154302&curpostid=154302">RWT</a> my choice was better: <blockquote>3 AGU + 3 ALU is a much better mix. Remember that x86 is load+op, so generally you want to sustain nearly a 1:1 ratio of memory to ALU operations.</blockquote> The prediction of 8-core dies for the CPUs of the AM4 socket, with higher core count in servers obtained with combinations of the base die is also confirmed. So far like I know I am the only one that predicted a four-die configuration --8 core x 4 = 32 core-- for the top Naples chip. As you can check in the above links, I even offered the next illustration <br><br><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-ggnNDD6__KM/WLcVJACQgRI/AAAAAAAAAeg/d970Fp3eA5kdL12EpnwayNs9O1heNHQQwCLcB/s1600/MCM4_SoC.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://4.bp.blogspot.com/-ggnNDD6__KM/WLcVJACQgRI/AAAAAAAAAeg/d970Fp3eA5kdL12EpnwayNs9O1heNHQQwCLcB/s600/MCM4_SoC.png" width="600" height="275" /></a></div><br><br>And this is a recent shot of the Naples CPU, now renamed to EPYC, <br><br><div class="separator" style="clear: both; text-align: center;"><a href="https://pics.computerbase.de/7/8/7/9/3/article-630x354.c1be78f7.jpg" imageanchor="1" ><img border="0" src="https://pics.computerbase.de/7/8/7/9/3/article-630x354.c1be78f7.jpg" width="400" height="282" /></a></div><br><br>I also predicted a separate die for the AM4 APUs with four-cores and iGPU, and mentioned that AMD could be offering two different lines of four-core CPUs for this socket, one line from the 8-core die with half the cores disabled and another line from the APUs with the iGPU disabled; something similar to what AMD does now with the FX-4000 series and the Athlon series. <br><br>I also predicted Zen would come in four-core clusters. This is what AMD names the CCX. My proposal was based in a hypothesis about AMD wanting to increase the minimal core count to four plus a hypothesis about AMD reusing the cluster for the APUs and for the semicustom division to reduce the design costs with a modular approach. My prediction was that Zen would come in groups of CCX with SMT disabled or not. For instance 4c/4t for the lowest Ryzen CPU and then 4c/8t and 8c/8t for the intermediate models, and 8c/16t for the flagship AM4 socket model. With servers coming in combinations of 8c/16t and 16c/32t for the dual die socket (SP4), and 16c/32t and 32c/64t for the quad die socket (SP3). <br><br>With this hypothesis in mind, I predicted a six-core would not exist. The R5 model would be 8c/8t, similar to how Intel i5s are 4c/4t in the desktop. Even some new sites reported early rumors about six-core Ryzen not existing, but this turned to be a error. Six core Ryzen exist! This is a bit weird. Why do Ryzen is designed about four-core clusters if you are going to disable cores within each CCX individually? We know now that the RyZen performance varies depending on what cores are disabled and what cores are active, because the weird design introduces a last level cache partitioned in blocks with different access latencies. For instance, a 4+0 chip --that is one CCX disabled in the die-- is not the same that a 3+1 chip or a 2+2 chip --two cores disabled per CCX--. Finally AMd took the decision of selling all the CPUs with symmetric configurations: 3+3 for the six-core chips and 2+2 for the quad-core chips. This even adds more weirdness to the CCX design choice, because independently of what cores are damaged in the die, AMD always has to pair of cores in both CCX in the die. <br><br>My prediction of IPC was "~50% IPC over Piledriver on scalar code. ~80% IPC on SIMD code". This is easy to understand. I predicted 2x128 FMA units for Ryzen; Piledriver has a 2x128 FMA unit per module, which accounts to 128bit per core. Thus Piledriver is a 8FLOP/core design, whereas Zen is 16FLOP/core. This is the maximum throughput. Average performance is less than the maximum, because not all the resources are duplicated. Recall that floating point codes tend to stress much more the memory subsystem, up to the point that many supercomputers are able to hit high peaks on HPC applications but their sustained performance is much less. This is why I predicted the floating point IPC in Zen would be about 80% better despite having twice more FP peak performance. About integer, Piledriver has (2 ALU + 2 AGU) per core. Recall that my hypothesis was that Ryzen would be (3 ALU + 3 AGU), which accounts for 50% more integer+memory resources than Piledriver. This was the basis for my "~50% IPC" claim where, evidently, I assumed that rest of resources (front-end, caches,...) would scale up conveniently to feed the extra ALUs and AGUs. <br><br>We know now that Ryzen is (4 ALU + 2 AGU), this means that the peak integer performance is better because Ryzen can execute up to four integer operations per cycle instead the three operations I had envisioned; but, on the opposite side, Ryzen has one AGU less than I expected, which means cannot feed the ALUs so well. The higher peak performance is compensated by the lower sustained performance and the net result is that the performance of Ryzen is very close to what I expected: 50% and 80%. Indeed, AMD has pretty much confirmed my IPC predictions, with official claims that Zen is 52% faster than Piledriver on SPECint and 76% faster than Piledriver on Cinebench, both clock-for-clock. <br><br><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-9PHSRMsLkws/WLcXE4snmdI/AAAAAAAAAeo/k_H5LVlW4hwOc8PF0O_a4mn8zWyyB-ZCwCLcB/s1600/AMD%2BRyzen%2BTech%2BDay%2B-%2BLisa%2BSu%2BKeynote-32%2B%2528IPC%2BCinebench%2529.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://4.bp.blogspot.com/-9PHSRMsLkws/WLcXE4snmdI/AAAAAAAAAeo/k_H5LVlW4hwOc8PF0O_a4mn8zWyyB-ZCwCLcB/s600/AMD%2BRyzen%2BTech%2BDay%2B-%2BLisa%2BSu%2BKeynote-32%2B%2528IPC%2BCinebench%2529.jpg" width="600" height="200" /></a></div><br><br>My estimations suggested strongly, as I put it dozens of times in many forums that <blockquote>Zen IPC ~ Sandy Bridge <br>Zen IPC+SMT ~ Haswell</blockquote> Reviews have found that on average Ryzen IPC = 1.05 Sandy Bridge and IPC+SMT = 0.90 Haswell. Effectively, clock for clock, Broadwell is about 10% ahead of Ryzen on applications <div class="separator" style="clear: both; text-align: center;"><a href="http://www.hardware.fr/getgraphimg.php?id=438&n=1" imageanchor="1" ><img border="0" src="http://www.hardware.fr/getgraphimg.php?id=438&n=1" width="500" height="800" /></a></div><br><br>and about 20% ahead on games <br><br><div class="separator" style="clear: both; text-align: center;"><a href="http://www.hardware.fr/getgraphimg.php?id=448&n=1" imageanchor="1" ><img border="0" src="http://www.hardware.fr/getgraphimg.php?id=448&n=1" width="500" height="670" /></a></div><br><br>Let us check now clocks. I finally predicted 3.0GHz base and 3.5GHz turbo for a 95W 8-core Ryzen CPU. And 3.4GHz base and 3.9GHz turbo for a 65W 4-core CPU. Of course, the predictions are for the faster models that fit within the thermal envelope. The more astute readers can check I was assuming quadratic dependence in my computations. Let me remark that so early as 19/08/2014 I was mentioning "3.0--3.5GHz" for Zen, which makes even more ridiculous some claims that one can find in the Internet. The truth is that the top 65W 4-core model is the R5 1500X, which has 3.5GHz and 3.9GHz frequencies. The prediction was very accurate with errors of 3% and 0% respectively. <br><br>The things are completely different for the 95W 8-core chips. The top model R7-1800X has 3.6GHz and 4.1GHz frequencies, which is a noticeable departure from the expected values. How can the same prediction be accurate for the 4-core chips but fail for 8-core chips, when both use the same die and the same process node? We finally have an answer. If we take the R5-1500X as baseline, we can obtain the next approximated estimation of frequencies for a 8-core 95W chip <br><br>sqrt[ 95W / (2 x 65W) ] * 3.5GHz = 2.99GHz <br><br>sqrt[ 95W / (2 x 65W) ] * 3.9GHz = 3.33GHz <br><br>The only possibility to get higher clocks is if we increase the TPD beyond the marketing value of 95W. Indeed, if we invert the above formulas and use as input the frequencies of the R7-1800X, we can estimate the real TDP of this 8-core chip. The result is 141W as an average of base and turbo estimations. Effectively, Ryzen reviews have demonstrated that the 95W is a marketing label and that the 1800X can dissipate even more power than that 125W-rated FX-8350 Piledriver chip; for instance, HFR measured 125W for the FX-8350 and 128.9W for the R7-1800X, which can be rounded to 129W. The 12W discrepancy between the measured 129W and the 141W estimated above can be easily explained by the fact that the number of active transistors in the R5-1500X is not strictly one-half those in the R7-1800X, by binning (the R5 models use defective dies), by small departures from a strict quadratic dependence between frequency and power. <br><br>Reviews have also found that the 65W rating for the R7 1700 is another marketing label; As CanardPC <a href="https://mobile.twitter.com/CPCHardware/status/843109717610287105/actions">noticed</a> "<i>Le 1700 tire 90W en réalité. AMD bullshit son TDP.</i>" Luckily for us, AMD finally reported the real TDPs for the 8-core chips: <blockquote>What are the TDPs, within the meaning of the consumption limit and therefore the maximum number of watts to be dissipated, of the Ryzen? AMD also communicates this value, but less markedly: 128 watts for the 1800X / 1700X, and 90 watts for the 1700. These are the values that are most comparable with the TDP communicated by Intel.</blockquote><br><br>This explains very well the discrepancies on the predictions for the 8-core chips. Effectively the prediction of 3.0GHz base and 3.5GHz turbo for a 95W 8-core Ryzen CPU fits nicely with the clocks of the CPU with a real TDP of 90W: the R7-1700 model. Adjusting stock frequencies for 95W we obtain a prediction error of 6%, which is excellent. <br><br>Finally, let us return the fact some people is reporting false claims about my predictions in the Internet. For instance, they pretend that I said that the IPC of Ryzen would be exactly Sandy Bridge, ignoring that I always used the symbol "~". Contrary to what you would think in a first moment, their pretension does not follow from raw ignorance about the meaning of the symbol, because I explained them what the symbol means. The real reason is that those pretended 'experts' use <i>ad hominem</i> tactics to try to hide how wrong were their own predictions. Indeed, if you research a bit you will find those pretended 'experts' predicted CMT, big core (~10mm²), quad-channel for AM4, 2x256 bit or 3x256 bit FMA units, a range of 4.0--4.5GHz base clock 8-core 95W models and a superenthusiast 5GHz 8-core stock model, TSMC 16nm process, DDR3 optional, 12-cores or more for AM4, 2016 launch, a separate 16-core die for servers, 8-core 3.7GHz/4.1GHz 65W models, that Zen was not a SoC, IPC well above BDW and on pair with Kabylake, optional L3, less than 150mm² size for the 8-core die, and several dozens more of failed predictions. <br><br>Their failed predictions are not reduced to Zen. Those are the same 'experts' that predicted 3.2GHz or higher base clocks for the new Sony console on 14nm, HBM on Carrizo APU, 16CUs on the Bristol Ridge APU, that Nintendo Switch sales would be a flop, that no one would use the ARM ISA for servers/HPC, and so on and so on. I am still awaiting for them to admit a single error; instead they continue attacking people that did it right. It is also worth to mention that this same people has a long record of being permanently banned from lots of tech sites.Juan Ramón González Álvarezhttps://plus.google.com/109139504091371132555noreply@blogger.comtag:blogger.com,1999:blog-3564535652306217326.post-50038548635675957012017-02-04T11:43:00.000-08:002017-02-22T02:08:02.735-08:00There are no negative heat capacities<link rel="icon" href="http://www.juanrga.com/favicon.ico?v=2" /> The existence of exotic systems with anomalous negative heat capacities have been claimed in the recent literature. Those negative heat capacities are responsible for all kind of peculiar thermodynamic behaviors, such as the temperature of the anomalous system being reduced when the system is heated. Close inspection shows that negative heat capacities have not been measured, neither do exist. <br><br>Classic thermodynamic stability theory states that heat capacities have to be positive [1,2]. For a system in thermal contact with a larger system, the second order virtual variation in entropy around an equilibrium state is \[ \delta^2 S = - \frac{C_V (\delta T)^2}{T^2} \lt 0 , \] with heat capacity at constant volume \[ C_V = \left( \frac{\partial E}{\partial T} \right)_V , \] and the negative sign required by stability implying that \( C_V \gt 0 \). Recall that the quantities used in thermodynamics are averages over microscopic realizations. For instance, the thermodynamic internal energy is given by \( E = \langle E^\mathrm{micr} \rangle \). <br><br>Statistical mechanics introduces a different proof of the positivity of heat capacities [3]. It starts from the expression for the energy average over a canonical ensemble \[ E = \langle E^\mathrm{micr} \rangle = \frac{\sum_i E_i^\mathrm{micr} \exp(-E_i^\mathrm{micr}/k_\mathrm{B}T)}{\sum_i \exp(-E_i^\mathrm{micr}/k_\mathrm{B}T)} , \] and gets \[ \frac{\diff E}{\diff T} = \frac{1}{k_\mathrm{B}T^2} \bigg\langle \big(E_i^\mathrm{micr} - \langle E^\mathrm{micr} \rangle \big)^2 \bigg\rangle , \] which is clearly a positive quantity. Notice that the canonical ensemble does not include the volume as variable, and the above total derivative is equivalent to the partial derivative that defines \( C_V \). <br><br>Part of the nanocluster community claims that small systems are not bound by the above macroscopic proofs, and that negative heat capacities exist at small scales. Michaelian and Santamaría-Holek show that these incorrect results (both experimental and theoretical) derive from two basic problems: (<b>i</b>) the system is non-ergodic or (<b>ii</b>) the model used to represent the system does not obey quantum laws [2]. <br><br>On the other side of the size spectrum, very large systems, we find to astronomers claiming that the above macroscopic proofs are wrong and that heat capacity can be negative in gravitational systems [3]. Their starting point is the virial theorem that relates kinetic energy \( K = \langle K^\mathrm{micr} \rangle \) and potential energy \( \Phi = \langle \Phi^\mathrm{micr} \rangle \) for an isolated gravitational system with energy \( E = K + \Phi \) \[ 2 K + \Phi = 0 . \] Using \( K = \frac{3}{2} N k_\mathrm{B} T \) yields \[ \frac{\diff E}{\diff T} = - \frac{\diff K}{\diff T} = - \frac{3}{2} N k_\mathrm{B} . \] Their 'proof' concludes by associating \( C_V \) to this negative quantity. Astronomers and nuclear physicists not only pretend that negative heat capacities are theoretically admissible, but they also claim those capacities are routinely measured in gravitational systems and nuclear clusters [3], by plotting \( E \) vs \( T \) and then finding regions where energy decreases when temperature increases; i.e., finding regions where \( (\diff E /\diff T) \lt 0 \). A more accurate description of the astronomers' claim is <blockquote>when heat is absorbed by a star, or star cluster, <b>it will expand</b> and cool down.</blockquote> I bolded the relevant part that disproves their claims. The problem is that the quantity \( (\diff E/\diff T) \) that they are measuring is not \( C_V \) because volume is not held constant during differentiation. If we write a detailed model of the energy we find that the potential energy \( \Phi \) depends on volume through the interparticle distances \[ E(T,V) = K(T) + \Phi(V) = \frac{3}{2} N k_\mathrm{B} T + \Phi(V) . \] Now, using the definition (2), we find \[ C_V = \left( \frac{\partial E(T,V)}{\partial T} \right)_V = \frac{\diff K(T)}{\diff T} = \frac{3}{2} N k_\mathrm{B} \gt 0 . \] Therefore, what astronomers and others are really measuring is not the heat capacity \( C_V \) but the abstract quantity \[ \frac{\diff E}{\diff T} = C_V + \frac{\diff\Phi(V(T))}{\diff T} . \] The sign of this abstract quantity depends of the nature of the interactions. For systems satisfying the virial theorem this quantity is just <i>minus</i> the heat capacity \( (\diff E/\diff T) = - C_V \). <br><br><br><h4>Acknowledgement</h4><br>I thank Prof. Karo Michaelian for further remarks. <br><br><br><h4>References</h4><br><b>[1]</b> Modern Thermodynamics <b>1998:</b> <i>John Wiley & Sons Ltd.; Chichester.</i> Kondepudi, D. K.; Prigogine, I. <br><br><b>[2]</b> Critical analysis of negative heat capacity in nanoclusters <b>2007:</b> <i>EPL, 79, 43001.</i> Michaelian K.; Santamaría-Holek I. <br><br><b>[3]</b> Negative Specific Heat in Astronomy, Physics and Chemistry <b>1998:</b> <i>arXiv:cond-mat/9812172 v1.</i> Lynden-Bell, D. Juan Ramón González Álvarezhttps://plus.google.com/109139504091371132555noreply@blogger.comtag:blogger.com,1999:blog-3564535652306217326.post-14393380137095867692017-01-03T09:50:00.001-08:002018-04-22T05:42:34.118-07:00State space evolution beyond mechanicsOur starting point will be the assumption that the state of our system (biological, physical, chemical, or otherwise) at a given time \( t \) is represented by a collection of \( D \) generic coordinates joined in a vector \( \mathbf{C}(t) = (C_1(t), C_2(t), C_i(t), \ldots C_D(t)) \). Note this vector depends on time <i>implicitly</i>. <br><br>Next, we postulate the existence of a conserved property, named energy, as a function of the state variables \( E = E(\mathbf{C}(t)) \). Differentiating the energy yields \[ \frac{\diff E}{\diff t} = \sum_j \frac{\partial E}{\partial C_j} \frac{\diff C_j}{\diff t} , \] which provides an exact expression of the rate of change of the generic coordinate \( C_i \) \[ \frac{\diff C_i}{\diff t} = \left( \frac{\partial E}{\partial C_i} \right)^{-2} \frac{\diff E}{\diff t} \frac{\partial E}{\partial C_i} - \sum_{j\ne i} \frac{\diff C_j}{\diff t} \left( \frac{\partial E}{\partial C_i} \right)^{-1} \frac{\partial E}{\partial C_j} = \sum_j L_{ij} \frac{\partial E}{\partial C_j} . \] We can write the above equation in vector-matrix form \[ \frac{\diff \mathbf{C}}{\diff t} = \mathbf{L} \frac{\partial E}{\partial \mathbf{C}} = \mathbf{K} \mathbf{C} . \] This is a general equation for the deterministic evolution of any system whose state is given by a non-stochastic vector \( \mathbf{C} \). The scope of this equation of evolution is beyond mechanics because \( \mathbf{C} \) is not limited to the positions and velocities (or momenta) of particles. Note that even if we restrict the vector to \( \mathbf{C} = (\mathbf{p}, \mathbf{q}) \) the description is still more general than Hamiltonian mechanics because the equation (3) can deal with dissipative systems. <br><br><br><h4>Uncertainty and stability</h4><br>The above expressions are deterministic. To introduce fluctuations and uncertainty we seek a generalized equation \[ \frac{\diff \mathbf{C}}{\diff t} = \mathbf{K} \mathbf{C} + \mathbf{f} , \] with \( \mathbf{f} \) measuring the difference between the actual rate and the deterministic prediction <div class="separator" style="margin:2em 0;text-align:center;"><a href="https://4.bp.blogspot.com/-zkSDSl1FTc4/WHkqpYWbSVI/AAAAAAAAAaI/vU6lLEMWoicqc1MBZ0JSQHkhHWRmsjZmgCLcB/s1600/Fluctuations.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-zkSDSl1FTc4/WHkqpYWbSVI/AAAAAAAAAaI/vU6lLEMWoicqc1MBZ0JSQHkhHWRmsjZmgCLcB/s600/Fluctuations.png" /></a></div> <link rel="icon" href="http://www.juanrga.com/favicon.ico?v=2" />Juan Ramón González Álvarezhttps://plus.google.com/109139504091371132555noreply@blogger.comtag:blogger.com,1999:blog-3564535652306217326.post-65794854077457345952016-12-03T06:28:00.001-08:002017-06-02T04:09:49.542-07:00What is heat?<link rel="icon" href="http://www.juanrga.com/favicon.ico?v=2" /> Everyone has an intuitive conception of heat as something related to temperature, but a rigorous and broadly accepted scientific definition of heat is lacking despite several centuries of study. \( \newcommand{\dbar}{{{}^{-}\mkern-12.5mu \diff}} \) <br><br><br><h4>Energy transfer or state quantity?</h4><br>Callen defines heat as the variation in internal energy \( E \) has not been caused by work \[ \dbar Q = \diff E - \dbar W . \] We find here the first anacronism. Heat is represented by an «<i>inexact differential</i>» (symbol \(\dbar \)) because heat is not a state function in the thermodynamic space. <br><br>Kondepudi & Prigogine suggest the alternative definition \[ \diff Q = \diff E - \diff W - \diff_\mathrm{matter} E . \] Not only a new mechanism of interchange of energy associated to changes in the composition \( N \) produced by a mass flow with surrounds is introduced, but exact differentials are used because the classical thermodynamics space has been extended with time as variable. Their \( \diff Q \) has to be interpreted in the sense of \( \diff Q(t) \), albeit Kondepudi & Prigogine do not explain how the state space has to be extended. Do they mean \( (E,V,N,t) \) or \( (E(t),V(t),N(t)) \)? Something else? <br><br>Truesdell tries to abandon inexact differentials by just working with rates \[ \mathfrak{Q} = \dot{E} - \mathfrak{W} , \] here \( \mathfrak{Q} \) is what he calls «<i>heating</i>» and \( \mathfrak{W} \) the «<i>net working</i>»; the dot denotes a time derivative. But this didn't solve anything, because the issue reappears when one want to compute \( \diff E \) without being forced to use time as variable. What is more, even using time, we would be carrying up expressions like \( \mathfrak{Q} \diff t\). <br><br>A similar inconsistency is found in the work of Müller & Weiss, when they write down the rate of change of energy of a body as the contribution of what they call «<i>heating</i>» \( \dot{Q} \) and «<i>working</i>» \( \dot{W} \). Again this kind of notation is ambiguous and looks as the time derivative of state quantities \( Q \) and \( W \) that do not really exist in their formalism. <br><br>On the opposite side we find to Sohrab, who proposes to abandon inexact differentials by introducing a new concept of heat \( Q=TS \) as the product of temperature \( T \) and entropy; upon differentiation, \[ \diff Q = T \diff S + S \diff T . \] There are, however, issues with his approach because \( T \) and \( S \) cannot be both variables of state at same instant, and the Gibbs & Duhem expression cannot be used here to get rid of the undesired differential, like in the tradittional approach. The mixed quantity defined by Sohrab lives somewhat between the state spaces \( (S,V,N) \) and \( (T,V,N) \). <br><br>It is common to switch to a local formulation in term of fluxes and densities in the irreversible formalisms. Callen <i>defines</i> a generic flow \( J_G \) through \( J_G = dG/dt \), with \( G \) being any extensive <i>state variable</i>. Callen then proposes a heat flux given by the internal energy flow \( J_E \) minus a chemical contribution weighted by the chemical potential \( \mu \) \[ J_Q = J_E - \mu J_N . \] Not only this heat flow concept does not match his generic definition of flow because \( J_Q \) does not represent a flow of heat, but a <i>flow of energy</i>. Heat is not stored in system \( A \) before flowing to system \( B \) through a boundary; there is only energy flowing and people calling heat to part of that energy flow. In practice, authors act like if the terms heat and heat flux are interchangeable, which is so inconsistent as pretending that \( E \stackrel{wrong}{=} J_E \). If this was not enough, not everyone agrees with (5), and whereas Kondepudi & Prigogine add a molar entropy contribution \( s_m \) to the chemical potential \[ J_Q = J_E - (\mu + Ts_m) J_N , \] DeGroot & Mazur use a plain \[ J_Q = J_E . \] Not only we find here three different definitions for the flux, but each introduces fundamental changes to the concept of heat. For instance, in the formalism of DeGroot & Mazur, the rate of change of heat per unit of volume \( q \) is exclusively due to flow through the boundaries \[ \frac{\diff q}{\diff t} = - \nabla J_Q , \] but the choice by Kondepudi & Prigogine forces us to modify this expression by adding a source term associated to the «<i>production of heat</i>» \[ \frac{\diff q}{\diff t} = - \nabla J_Q + \sigma^\mathrm{heat} . \] This source term contains different contributions, including what the authors call the «<i>heat of reaction</i>» generated by chemical reactions taking place inside the system. <br><br>The inconsistencies are obvious now. This confusion is amplified in the engineering literature, where the term «<i>heat transfer</i>» is used routinely. If heat is, as Callen emphasizes, «<i>only a form of energy transfer</i>» through the boundaries, then it makes no sense to talk about the production of heat inside a system, whereas the term heat transfer is an oxymoron. If we consider that heat can be produced or absorbed inside the system, then heat cannot be exclusively identified with a mechanism of transfer of energy. Even if we consider heat only as transfer of energy, and rework the existing thermodynamic formalisms to eliminate any heat source term from equations, this does not completely eliminate the inconsistencies. This criticism is also addressed to myself, because I also contributed with a definition of \( J_Q \) for open systems. I now retract from such work. <br><br><br><h4>Relativistic heat</h4><br>We will ignore now the issues reported in the previous section. The question we want to bring to this section is, which is the heat for a moving system if \( \dbar Q \) is the heat for a system at rest? <br><br>If you ask Planck, Einstein, von Laué, Pauli, or Tolman the heat \( \dbar Q' \) for the moving system is given by \[ \dbar Q' = \frac{\dbar Q}{\gamma} , \] with \( \gamma \) the Lorentz factor, whereas Ott, Arzeliés, and Einstein (again) propose the alternative expression \[ \dbar Q' = \gamma \, \dbar Q . \] It is important to mention how Møller, in the first edition of his celebrated textbook on relativity, used the Planck expression, but replace it by the Ott expression in late editions. More recently Landsberg et al. introduced still another expression \[ \dbar Q' = \dbar Q . \] Thus, heat can decrease, increase, or be a Lorentz invariant depending on whom you ask. <br><br>Related to this, there are further discussions between authors that claim that relativistic heat is a scalar \( \dbar Q \) and those that claim that heat has to be generalized to a four-vector \( \dbar Q^\mu \) quantity for a proper relativistic treatment. <br><br>The conclusion for this section is the lack of consensus on what is correct concept of heat to be used in a relativistic context or how this heat behaves under Lorentz transformations. <br><br><br><h4>Microscopic heat?</h4><br>Traditionally, heat has been relegated to the macroscopic classic domain; however, there is increasing interest in last decades to extend thermodynamic concepts to mesoscopic and microscopic domains. We will ignore all the debate and issues reported in the former sections and will focus on answering what concept of heat at microscale corresponds to the traditional expression \( \dbar Q \). <br><br>Most authors start from the statistical mechanics expression for the average internal energy of a system \[ \langle E \rangle = \mathrm{Tr} \{H\rho\} , \] with \( \mathrm{Tr} \) denoting a quantum trace or the classical phase space integration, \( H \) the Hamiltonian, and \( \rho \) the statistical operator or the classical phase space density representing mixed states. Differentiation of this expression gives \[ \diff \langle E \rangle = \mathrm{Tr} \{\rho \diff H\} + \mathrm{Tr} \{H \diff\rho\} , \] so macroscopic heat is identified with the second term \[ \dbar Q = \mathrm{Tr} \{H \diff\rho\} , \] which suggested to some authors to take \( \{H \diff\rho\} \) as the «<i>microscopic definition</i>» of heat. This identification is open to debate. The first problem is that the definition is based in a density operator or phase space density that is associated to our ignorance about the microscopic state of the system. Standard literature claims that heat is related to changes on the probabilities of state occupations, but this claim is difficult to accept because it would suggest that heat varies with our level of knowledge about a system. Indeed, if we know the positions and velocities of particles (e.g., in a computer simulation), then the phase space density is given by a product of Dirac delta functions \( \rho = \delta_D(\boldsymbol x- \boldsymbol x(t))\delta_D(\boldsymbol v- \boldsymbol v(t)) \) and it is easy to verify that \( H \diff\rho = 0 \) in this case. However, atoms do not care about our knowledge! <br><br>A second problem is that, this «<i>microscopic definition</i>» is not microscopic at all, and it would be better considered mesoscopic, because it is combining microscopic elements such as the Hamiltonian of a system of particles, with macroscopic elements as the parameters that define the Gibbsian ensembles; indeed, the thermodynamic temperature associated to the canonical ensemble is not a microscopic quantity. <br><br>Roldán, based in former work by Sekimoto, proposes an alternative expression for microscopic heat. He starts with Langevin dynamics \[ m \frac{\diff \boldsymbol v}{\diff t} = \boldsymbol F^\mathrm{sist} + \boldsymbol F^\mathrm{diss} + \boldsymbol F^\mathrm{rand} , \] then he associates heat with dissipative and random components of work \[ \dbar Q = ( \boldsymbol F^\mathrm{diss} + \boldsymbol F^\mathrm{rand} ) \diff \boldsymbol x , \] which after formal manipulations yields —typos and sign mistakes in his work are corrected here— \[ \dbar Q = \diff \left( \frac{1}{2} m \boldsymbol v^2 + \Phi^\mathrm{ext} \right) - \dbar W , \] with \( \Phi^\mathrm{ext} \) the external potential energy and his «<i>microscopic work</i>» being given by \[ \dbar W = \frac{\partial \Phi^\mathrm{ext}}{\partial\lambda} \diff \lambda . \] Roldán claims to «<i>recover the first law of thermodynamics in the microscopic scale</i>». This is not true. First, what he calls internal energy is not an internal energy, but the total energy of the system. In the second place, his definition of work is invalid. Work is not given by the variation of energy maintaining constant the position. It is impossible to do \( pV \) work on a system maintaining intact the positions of particles, for instance. Finally, what he considers a microscopic approach is not microscopic at all, but mesoscopic; precisely the dissipative and random forces in Langevin dynamics are obtained from averaging the microscopic forces over a heat bath distribution that describes the bath only in a macroscopic sense. <br><br><br><h4>Heat from first principles</h4><br>After this basic review of the difficulties and inconsistencies with usual thermodynamic literature, our role will be to rigorously identify heat from a fundamental approach. We start with the mechanical expression for the internal energy \( E^\mathrm{micr} \) of a system and compute the infinitesimal variation \[ \diff E^\mathrm{micr} = \boldsymbol F^\mathrm{ext} \diff \boldsymbol x . \] This is a standard mechanical result with \( \boldsymbol F^\mathrm{ext} \) the forces from the surrounds. Note that the macroscopic internal energy \( E \) used in thermodynamics corresponds to taking an average over the mechanical expression \( E = \langle E^\mathrm{micr} \rangle \). <br><br>Now, we will split the mechanical motions of the particles into two modes: a collective mode that produces changes associated to a parameter \( \lambda \) that describes some property of the whole system, plus individual modes \( \boldsymbol s \) that describe changes on particle positions are not measured by this parameter. We will take as parameter the volume \( V \) of the system; this choice is motivated by simplicity, the generalization to other parameters is straighforward. The split is given by \[ \diff \boldsymbol x = \frac{\partial \boldsymbol x}{\partial V} \diff V + \frac{\partial \boldsymbol x}{\partial \boldsymbol s} \diff \boldsymbol s . \] Introducing this back into (20) yields \[ \diff E^\mathrm{micr} = -p^\mathrm{micr} \diff V + \boldsymbol F^\mathrm{ext} \frac{\partial \boldsymbol x}{\partial \boldsymbol s} \boldsymbol s . \] This continues being a purely mechanical expression. \( p^\mathrm{micr} = - \boldsymbol F^\mathrm{ext} {\partial \boldsymbol x}/{\partial V} \) is what authors call the «<i>microscopic or instantaneous pressure</i>». The macroscopic pressure \( p \) used in thermodynamics is again given by an average \( p = \langle p^\mathrm{micr} \rangle \). Since the first term in the above equation is a microscopic generalization of the \( pV \) work used in thermodynamics, we can associate the second term with a microscopic generalization of thermodynamic heat \[ \dbar Q^\mathrm{micr} = \boldsymbol F^\mathrm{ext} \frac{\partial \boldsymbol x}{\partial \boldsymbol s} \diff \boldsymbol s . \] We recover here a concept of heat as changes in energy associated to modes of motion that do not produce change in the mechanical parameters that describe the system as a whole. Note that, unlike the conventional wisdom, heat here is not associated to ignorance; we can utilize a complete description of atomic motion. We can obtain further expressions for the heat if we write explicit expressions for \( E^\mathrm{micr} \). The internal energy for a nonrelativistic system can be shown to be given by \[ E^\mathrm{micr} = C_V T^\mathrm{micr} + \Phi^\mathrm{micr} , \] with \( C_V \) being what thermodynamicists call the «<i>heat capacity</i>» at constant volume —an unfortunate name if one insists on considering heat only as transfer of energy— and \( \Phi^\mathrm{micr} \) the interaction energy; this expression for the energy is exact and \( T^\mathrm{micr} \), the instantaneous or microscopic temperature, would not be confused with the thermodynamic temperature which is evidently given by \( T = \langle T^\mathrm{micr} \rangle \). <br><br>Differentiating energy and using the split (21) we obtain for the <b>microscopic heat</b> \[ \dbar Q^\mathrm{micr} = C_V \diff T^\mathrm{micr} + \left[ \frac{\partial\Phi^\mathrm{micr}}{\partial V} + p^\mathrm{micr} \right] \diff V + \frac{\partial\Phi^\mathrm{micr}}{\partial \boldsymbol s} \diff \boldsymbol s . \] We can now split each one of the microscopic quantities into an average term plus a deviation from the average; for instance, for the interaction energy \( \Phi^\mathrm{micr} = \langle \Phi^\mathrm{micr} \rangle + \delta \Phi^\mathrm{micr} = \Phi + \delta \Phi^\mathrm{micr} \); and use this to obtain an expression for the classic thermodynamic heat \( \, \dbar Q \) plus microscopic corrections \( \, \dbar (\delta Q^\mathrm{micr}) \) \[ \dbar Q = C_V \diff T + \left[ \frac{\diff\Phi}{\diff V} + p \right] \diff V , \] \[ \dbar (\delta Q^\mathrm{micr}) = C_V \diff (\delta T^\mathrm{micr}) + \left[ \frac{\partial(\delta\Phi^\mathrm{micr})}{\partial V} + \delta p^\mathrm{micr} \right] \diff V + \frac{\partial(\delta\Phi^\mathrm{micr})}{\partial \boldsymbol s} \diff \boldsymbol s . \] Note that the fact that the macroscopic average of the interaction energy does not depend on microscopic variables has been used to transform the partial derivative into a total derivative in (26). Taking the average of (24) we can write the macroscopic heat like \[ \dbar Q = C_V \diff T + \left[ \left( \frac{\partial E}{\partial V} \right)_T + p \right] \diff V , \] that is just the expression for the macroscopic heat found in the classical thermodynamic literature, the term within square brackets being what thermodynamicians call the «<i>latent heat</i>» —another unfortunate name— and denote by \( L_V \). Now let us compare our expression with those found in the literature. For instance, Lavenda gives in his study of the «<i>Microscopic Origins of the Carnot–Clapeyron Equation</i>» \[ L_V = \left( \frac{\partial\langle E^\mathrm{micr}\rangle_0}{\partial V} \right)_T - \left\langle \left( \frac{\partial E^\mathrm{micr}}{\partial V} \right)_T \right\rangle_0 . \] The subindex zero means he is using «<i>unperturbed probabilities</i>» \( \pi_n^0 \), associated to a canonical distribution, to compute the averages. Our expression \[ L_V = \left( \frac{\partial E}{\partial V} \right)_T + p = \left( \frac{\partial\langle E^\mathrm{micr}\rangle}{\partial V} \right)_T + p \] is not limited to ensemble averages and works beyond the scope of the canonical ensemble. Ignoring this detail, the real discrepancy is on the second terms. (30) contains a general average of the microscopic pressure \( p = \langle p^\mathrm{micr} \rangle \), whereas Lavenda uses the next unperturbed canonical average \[ \left\langle \left( \frac{\partial E^\mathrm{micr}}{\partial V} \right)_T \right\rangle_0 = \sum_n \pi_n^0 \left( \frac{\partial E_n^\mathrm{micr}}{\partial V} \right)_T .\] We find a inconsistency here, because the mechanical energy levels \( E_n^\mathrm{micr} \) do not depend functionally on the thermodynamic temperature —temperature is only a macroscopic parameter for the canonical ensemble—; this makes his partial derivative mathematically undefined and physically meaningless. <br><br><br><h4>Perspectives</h4><br>The concept of heat presented here has been derived from first principles, one assumption I have made is that the kinetic energy can be expressed like \( C_V T \), whereas this is exact in the non-relativistic domain, it remains to be evaluated if this expression can be maintained in the relativistic regime —apart from residual \( mc^2 \) terms, of course—. I can guarantee something now, however, and it is that a four-component heat concept is unneeded. Thus, relativistic heat will be a scalar. <br><br>I have used inexact differential notation for the sake of familiarity with standard thermodynamics literature. A way to avoid the term «<i>inexact differential</i>» and the corresponding alternative notation will be given in another part. <br><br><br><h4>Acknowledgement</h4><br>I thank Prof. Bernard H. Lavenda for interesting discussions. <br><br><br><h4>References</h4><br>Thermodynamics and an Introduction to Thermostatistics; Second Edition <b>1985:</b> <i>John Wiley & Sons Inc.; New York.</i> Callen, Herbert B. <br><br>Modern Thermodynamics <b>1998:</b> <i>John Wiley & Sons Ltd.; Chichester.</i> Kondepudi, D. K.; Prigogine, I. <br><br>Rational Thermodynamics <b>1968:</b> <i>McGraw-Hill Book Company; New York.</i> Truesdell, C. <br><br>On a Scale-Invariant Model of Statistical Mechanics and the Laws of Thermodynamics <b>2016:</b> <i>ASME. J. Energy Resour. Technol. 138(3): 032002-032002-12.</i> Sohrab, S. H. <br><br>Non-equilibrium thermodynamics <b>1984:</b> <i>Courier Dover Publications, Inc.; New York.</i> DeGroot, Sybren Ruurds; Mazur, Peter. <br><br>Thermodynamics of irreversible processes – past and present <b>2012:</b> <i>Eur. Phys. J. H, 37, 139-236.</i> Müller, Ingo; Weiss, Wolf. <br><br>Irreversibility and dissipation in microscopic systems – Tesis Doctoral <b>2013:</b> <i>Universidad Complutense de Madrid, Facultad de Ciencias Físicas, Departamento de Física Atómica, Molecular y Nuclear.</i> Roldán, Édgar. <br><br>A New Perspective on Thermodynamics <b>2010:</b> <i>Springer; New York.</i> Lavenda, Bernard H.Juan Ramón González Álvarezhttps://plus.google.com/109139504091371132555noreply@blogger.comtag:blogger.com,1999:blog-3564535652306217326.post-17110540067568581402016-07-26T09:15:00.000-07:002017-05-27T02:23:52.492-07:00Instantaneous electromagnetic interactions<link rel="icon" href="http://www.juanrga.com/favicon.ico?v=2" /> Newton introduced a model of instantaneous direct interactions among massive particles. This model was latter replicated by Coulomb for charged particles and became known as action-at-a-distance; an unfortunate name has generated unending polemics among physicists and philosophers. A better name is direct-particle-interaction. Maxwell electrodynamics and general relativity introduced an alternative model of contact-action, where particles do not interact directly but by means of signals traveling through a mediator. Since the maximum possible speed for any object is the speed of light, interactions are retarded in contact-action models. Althought the Newton and Coulomb models are only valid for low velocities, it is proven that retarded interactions can be derived as approximations to generalized instantaneous interactions beyond Newton and Coulomb. <br><br><br><h4>Simple derivation of Lienard & Wiechert potentials</h4><br>We will start with the next <i>instantaneous</i> scalar and vector potentials associated to a charge \( e \) placed at \( \boldsymbol{r} \) \[ \phi = \phi(\boldsymbol{x}, t) = \frac{1}{4\pi\epsilon_0} \frac{e}{|\boldsymbol{x} - \boldsymbol{r}(t)|} \] \[ \boldsymbol{A} = \boldsymbol{A}(\boldsymbol{x}, t) = \frac{1}{4\pi\epsilon_0c} \frac{e\boldsymbol{v}(t)}{|\boldsymbol{x} - \boldsymbol{r}(t)|} . \] Expanding both the position and the velocity of the charge around at some early time \( t_0 \) \[ \boldsymbol{r}(t) = \boldsymbol{r}(t_0) + \boldsymbol{v}(t_0) (t-t_0) + \boldsymbol{a}(t_0) (t-t_0)^2 / 2 + \cdots \] \[ \boldsymbol{v}(t) = \boldsymbol{v}(t_0) + \boldsymbol{a}(t_0) (t-t_0) + \cdots \] and assuming that the charge is not accelerating at the initial time \( t_0 = (t - |\boldsymbol{x} - \boldsymbol{r}(t_0)|/c) \) we obtain the retarded potentials \[ \phi^\mathrm{ret} = \frac{1}{4\pi\epsilon_0} \frac{e}{|\boldsymbol{x} - \boldsymbol{r}(t_0)| - \boldsymbol{v}(t_0) [ \boldsymbol{x} - \boldsymbol{r}(t_0) ] / c} , \] \[ \boldsymbol{A}^\mathrm{ret} = \frac{1}{4\pi\epsilon_0c} \frac{e \boldsymbol{v}(t_0)}{|\boldsymbol{x} - \boldsymbol{r}(t_0)| - \boldsymbol{v}(t_0) [ \boldsymbol{x} - \boldsymbol{r}(t_0) ] / c} , \] which are evidently the Lienard & Wiechert potentials found in any textbook. <br><br>Note that the Lienard & Wiechert potentials have been derived under the approximation \( \boldsymbol{a}(t_0) = 0 \). This means that Lienard & Wiechert potentials are not complete, and cannot describe the general motion of charged particles. This defficiency of the Lienard & Wiechert potentials explains why physicists have tradittionally added <i>ad hoc</i> reaction-radiation force terms to the field-theoretic equations of motion for curing such issues like net loss of energy on systems of accelerating charged particles. Those <i>ad hoc</i> modifications of the equations follow from a correction to the Lienard & Wiechert potentials \[ \phi^\mathrm{ret} \rightarrow \phi^\mathrm{ret} + \phi^\mathrm{rr} , \] \[ \boldsymbol{A}^\mathrm{ret} \rightarrow \boldsymbol{A}^\mathrm{ret} + \boldsymbol{A}^\mathrm{rr} . \] Not only those correction potentials \( \phi^\mathrm{rr}\) and \( \boldsymbol{A}^\mathrm{rr} \) cannot be derived from the field-theoretic electromagnetic Lagrangian or action —which only gives the Lienard & Wiechert potentials—, but those correction potentials are postulated in a truly inconsistent approach when the advanced solutions to the wave equations are initially rejected —"<i>and this unphysical solution to the wave equation is known as the advanced solution</i>" <b>[1]</b>— only to be used latter as part of the definition of the correction potentials \[ \phi^\mathrm{rr} \stackrel{\mathrm{def}}{=} \frac{\phi^\mathrm{ret} - \phi^\mathrm{adv}}{2} \] \[ \boldsymbol{A}^\mathrm{rr} \stackrel{\mathrm{def}}{=} \frac{\boldsymbol{A}^\mathrm{ret} - \boldsymbol{A}^\mathrm{adv}}{2} . \] If the retarded solutions are the "<i>only physically acceptable solution</i>" <b>[1]</b>, how is it possible that they have to be admended by the so-named unphysical solutions? The conventional wisdom does not make any sense. Moreover, even if we ignore those inconsistencies, the resulting equations of motion with reaction-radiation corrections are still subjected to criticism due to odd behaviors and properties. On the other hand the potentials (1) and (2) conserve energy and maintain causality, and do not require <i>ad hoc</i> reaction-radiation corrections. <br><br>Note that the \( \boldsymbol{a}(t_0) = 0 \) approximation also explains why the Lienard & Wiechert potentials for a moving charge can be obtained from the potentials for a charge at rest, \( \boldsymbol{A}=0 \) and \( \phi \), applying the Lorentz transformations between a frame \( S' \) where the particle is at rest and a frame \( S \) where the particle is moving with velocity \( \boldsymbol{v} \). The Lorentz transformations can be applied because both frames are inertial, that is, the particle is not accelerating. In fact, some derivations of the Lienard & Wiechert potentials explicitly admit that the charge is moving "<i>with uniform velocity \( \boldsymbol{v} \) through a frame \( S \)</i>". <br><br>This same approximation is the reason why quantum field theory admits as the only physical admissible states those of free particles, that is, particles are not accelerating. This is picturesquely described in Feynman diagrams <div class="separator" style="margin:2em 0;text-align:center;"><a href="https://1.bp.blogspot.com/-HSJrRELFH8A/V5JbX4_dQfI/AAAAAAAAAF0/42MqYiFQRigAluVe1_FTWsvEZg-Y4gI_ACLcB/s1600/feynman_diagram.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-HSJrRELFH8A/V5JbX4_dQfI/AAAAAAAAAF0/42MqYiFQRigAluVe1_FTWsvEZg-Y4gI_ACLcB/s400/feynman_diagram.jpg" width="400" height="310" /></a></div> The diagram considers electrons initially in free motion, states <b>1</b> and <b>2</b>, until a photon is emitted and absorbed, states <b>6</b> and <b>5</b> and, as a consequence, both electrons change their state of motion to new free states <b>3</b> and <b>4</b>. Quantum field theory only can rigorously describe the states <i>before</i> and <i>after</i> the interaction, <b>1</b>, <b>2</b>, <b>3</b>, and <b>4</b>, respectively; but cannot provide a description of what happens <i>during the interaction</i>, i.e. states <b>5</b> and <b>6</b>. <br><br><br><h4>Rigorous relation between instantaneous and retarded interactions</h4><br>The instantaneous potentials (1) and (2) can be written in integral form as \[ \phi = \frac{1}{4\pi\epsilon_0} \int \frac{\rho(\boldsymbol{y}, t)}{|\boldsymbol{x} - \boldsymbol{y}|} \mathrm{d}\boldsymbol{y} \] \[ \boldsymbol{A} = \frac{1}{4\pi\epsilon_0 c} \int \frac{\boldsymbol{j}(\boldsymbol{y}, t)}{|\boldsymbol{x} - \boldsymbol{y}|} \mathrm{d}\boldsymbol{y} , \] using the electron density \( \rho(\boldsymbol{y}, t) = e\delta(\boldsymbol{y} - \boldsymbol{r}(t)) \) and current \( \boldsymbol{j}(\boldsymbol{y},t) = \boldsymbol{v}(t) \rho(\boldsymbol{y},t) \). Now we can utilize the equations of motion to relate densities and currents at times \( t \) and \( t_0 \) \[ \phi = \frac{1}{4\pi\epsilon_0} \int \exp \Big[ L(t-t_0) \Big] \frac{\rho(\boldsymbol{y}, t_0)}{|\boldsymbol{x} - \boldsymbol{y}|} \mathrm{d}\boldsymbol{y} \] \[ \boldsymbol{A} = \frac{1}{4\pi\epsilon_0 c} \int \exp \Big[ L(t-t_0) \Big] \frac{\boldsymbol{j}(\boldsymbol{y}, t_0)}{|\boldsymbol{x} - \boldsymbol{y}|} \mathrm{d}\boldsymbol{y} . \] \( L \) in the above expression is the Liouvillian. We can now see clearly that a kernel evaluated at present time \( t \) is identical to a modified kernel evaluated at retarded time \( t_0 \) \[ \left\{ \frac{1}{|\boldsymbol{x} - \boldsymbol{y}|}\right\}_t = \left\{ \exp[L(t-t_0)] \frac{1}{|\boldsymbol{x} - \boldsymbol{y}|}\right\}_{t_0} .\] If we approximate the full Liouvillian by its free component \( L^\mathrm{free} = - \boldsymbol{v} \boldsymbol{\nabla_y} \) and integrate the expressions \[ \phi^\mathrm{ret} = \frac{1}{4\pi\epsilon_0} \int \exp \Big[ L^\mathrm{free}(t-t_0) \Big] \frac{\rho(\boldsymbol{y}, t_0)}{|\boldsymbol{x} - \boldsymbol{y}|} \mathrm{d}\boldsymbol{y} \] \[ \boldsymbol{A}^\mathrm{ret} = \frac{1}{4\pi\epsilon_0 c} \int \exp \Big[ L^\mathrm{free}(t-t_0) \Big] \frac{\boldsymbol{j}(\boldsymbol{y}, t_0)}{|\boldsymbol{x} - \boldsymbol{y}|} \mathrm{d}\boldsymbol{y} , \] we recover the Lienard & Wiechert potentials (5) and (6). <br><br>In the former section, we derived the Lienard & Wiechert potentials by neglecting acceleration and higher order terms in a series expansion of positions and velocities around an initial time. We can confirm now that the Lienard & Wiechert potentials are an exact consequence of the <i>free component of the generator of the motion</i> for charged particles, of course, this free component only describes inertial particles. The interaction Liovillian \( L^\mathrm{inter} \) introduces acceleration and other higher-order corrections to the Lienard & Wiechert potentials \[ \phi = \phi^\mathrm{ret} + \mathbb{O}(L^\mathrm{inter}) \] \[ \boldsymbol{A} = \boldsymbol{A}^\mathrm{ret} + \mathbb{O}(L^\mathrm{inter}) \] Those corrections are responsible for preserving causality and conserving energy on (13) and (14). <br><br><br><h4>A subtle mistake has remained unnoticed in the literature</h4><br>In this section we will deal only with the scalar potential without any loss of generality because the extension of the arguments and techniques to the vector potential is straightforward. The standard electromagnetic literature proposes the next integral equation as retarded solution to the wave equation \[ \phi^\mathrm{ret} = \frac{1}{4\pi\epsilon_0} \int\int \rho(\boldsymbol{y}, t') \frac{\delta(t-t'-|\boldsymbol{x}-\boldsymbol{y}|/c)}{|\boldsymbol{x}-\boldsymbol{y}|} \mathrm{d}\boldsymbol{y} \, \mathrm{d}t' . \] If we integrate first on position and then on time we obtain the Lienard & Wiechert potential (5). If instead we integrate first on time then we would obtain (16) \[ \phi^\mathrm{ret} = \frac{1}{4\pi\epsilon_0} \int \exp \Big[ L^\mathrm{free}(t-t_0) \Big] \frac{\rho(\boldsymbol{y}, t_0)}{|\boldsymbol{x} - \boldsymbol{y}|} \mathrm{d}\boldsymbol{y} .\] However the standard literature gives the wrong result \[ \phi^\mathrm{ret} \stackrel{\mathrm{wrong}}{=} \frac{1}{4\pi\epsilon_0} \int\frac{\rho(\boldsymbol{y}, t_0)}{|\boldsymbol{x} - \boldsymbol{y}|} \mathrm{d}\boldsymbol{y} . \] This discrepancy between (21) and (22) has its origin on subtle functional dependences on the time integration. The correct integration of (20) is as follows. First we let \( s \equiv (t' + |\boldsymbol{x}-\boldsymbol{y}|/c - t) \) be the new variable of integration. We have \( \mathrm{d}s / \mathrm{d}t' = 1 + (\mathrm{d}|\boldsymbol{x}-\boldsymbol{y}|/c\mathrm{d}t') \) and \[ \phi^\mathrm{ret} = \frac{1}{4\pi\epsilon_0} \int\int \Big( \frac{\mathrm{d}s}{\mathrm{d}t'} \Big) \rho(\boldsymbol{y}, t') \frac{\delta(s)}{|\boldsymbol{x}-\boldsymbol{y}|} \mathrm{d}\boldsymbol{y} \, \mathrm{d}s .\] Much care has to be taken when evaluating the term \( (\mathrm{d}s / \mathrm{d}t') \). The standard literature <i>assumes</i> that \( |\boldsymbol{x}-\boldsymbol{y}| \) does not depend on time \(t'\), set \( (\mathrm{d}s / \mathrm{d}t') = 1 \) and integrates on \(s\) to yield the incorrect expression (22). The subtle issue is that this assumption is only valid for points \( \boldsymbol{y} \) outside the path of the particle, \( \boldsymbol{y} \neq \boldsymbol{r}(t') \), but in this trivial case the retarded potential is identically zero \(\phi^\mathrm{ret} = 0\) by virtue of the delta function on the density \( \rho(y,t') = e\delta(\boldsymbol{y} - \boldsymbol{r}(t')) \). Within the particle path, the term \( |\boldsymbol{x}-\boldsymbol{y}| \) depends <i>implicitly</i> on time \(t'\) via the density because \( \boldsymbol{y} = \boldsymbol{r}(t') \). When this implicit time dependence is considered in \( (\mathrm{d}s / \mathrm{d}t') \), we obtain the correct expression (21), in full agreement with the mechanical result. The same arguments and methods can be used to demonstrate that the correct expression for the vector potential \(\boldsymbol{A}^\mathrm{ret}\) is given by \[ \boldsymbol{A}^\mathrm{ret} = \frac{1}{4\pi\epsilon_0c} \int\exp \Big[ L^\mathrm{free}(t-t_0) \Big] \frac{\boldsymbol{j}(\boldsymbol{y}, t_0)}{|\boldsymbol{x} - \boldsymbol{y}|} \mathrm{d}\boldsymbol{y} \] and not by the incorrect expression \[ \boldsymbol{A}^\mathrm{ret} \stackrel{\mathrm{wrong}}{=} \frac{1}{4\pi\epsilon_0c} \int\frac{\boldsymbol{j}(\boldsymbol{y}, t_0)}{|\boldsymbol{x} - \boldsymbol{y}|} \mathrm{d}\boldsymbol{y} .\] <br><h4>Quantum electrodynamics potentials</h4><br>We take the expressions (16) and (17), that produce the Lienard & Wiechert potentials of classical electrodynamics, and replace classical densities, currents, and Liouvillians by their quantum analogs. Using the quantum identity between Liouvillian and the Hamiltonian \( \exp [ L^\mathrm{free}\tau ] F = \exp [ iH^\mathrm{free}\tau/\hbar ] F \exp [ -iH^\mathrm{free}\tau/\hbar ] \) for arbitrary \(\tau\) and \(F\), we obtain the next quantum potentials \[ \phi^\mathrm{ret} = \frac{1}{4\pi\epsilon_0} \int \exp \Big[ iH^\mathrm{free}(t-t_0)/\hbar \Big] \frac{\rho(\boldsymbol{y}, t_0)}{|\boldsymbol{x} - \boldsymbol{y}|} \exp \Big[ -iH^\mathrm{free}(t-t_0)/\hbar \Big] \mathrm{d}\boldsymbol{y} \] \[ \boldsymbol{A}^\mathrm{ret} = \frac{1}{4\pi\epsilon_0 c} \int \exp \Big[ iH^\mathrm{free}(t-t_0)/\hbar \Big] \frac{\boldsymbol{j}(\boldsymbol{y}, t_0)}{|\boldsymbol{x} - \boldsymbol{y}|} \exp \Big[ -iH^\mathrm{free}(t-t_0)/\hbar \Big] \mathrm{d}\boldsymbol{y} . \] Quantum densities and currents are given by \( \rho = e \sum_i \sum_j c_i^{*} c_j u_i^\dagger u_j \) and \( \boldsymbol{j} = \sum_i \sum_j c_i^{*} c_j u_i^\dagger c\boldsymbol{\alpha} u_j \) respectively. Here \( u_k = u_k(\boldsymbol{y}) \) are solutions to the time-independent Dirac free equation: \( H^\mathrm{free} u_k = E_k u_k \). Further replacing the time delay by its value and with the energy difference expressed as \( E_i - E_j = \hbar w_{ij} \), we obtain \[ \phi^\mathrm{ret} = \frac{1}{4\pi\epsilon_0} e \sum_i \sum_j c_i^{*} c_j \int u_i^\dagger \frac{1}{|\boldsymbol{x} - \boldsymbol{y}|} \exp \Big[ iw_{ij}|\boldsymbol{x} - \boldsymbol{y}|/ c \Big] u_j \mathrm{d}\boldsymbol{y} \] \[ \boldsymbol{A}^\mathrm{ret} = \frac{1}{4\pi\epsilon_0} e \sum_i \sum_j c_i^{*} c_j \int u_i^\dagger \frac{ \boldsymbol{\alpha} }{|\boldsymbol{x} - \boldsymbol{y}|} \exp \Big[ iw_{ij}|\boldsymbol{x} - \boldsymbol{y}|/ c \Big] u_j \mathrm{d}\boldsymbol{y} , \] which is just the quantum electrodynamics result, albeit those expressions are not usually found in the literature. To make contact with the usual expressions in the literature, we need to evaluate the interaction energy \( V_{ee} = \int \rho \phi^\mathrm{ret} - \boldsymbol{j} \boldsymbol{A}^\mathrm{ret} \mathrm{d}\boldsymbol{x} \), \[ V_{ee} = \frac{1}{4\pi\epsilon_0} e^2 \sum_i \sum_j \sum_k \sum_l c_i^{*} c_k^{*} c_j c_l \int\int u_i^\dagger u_k^\dagger \frac{ 1 - \boldsymbol{\alpha}\boldsymbol{\alpha} }{|\boldsymbol{x} - \boldsymbol{y}|} \exp \Big[ iw_{ij}|\boldsymbol{x} - \boldsymbol{y}|/ c \Big] u_j u_l \mathrm{d}\boldsymbol{x}\mathrm{d}\boldsymbol{y} . \] This is the standard result found in quantum electrodynamics literature, with the associated one-photon operator in covariant gauge \[ V^\mathrm{\gamma CG} = \frac{1}{4\pi\epsilon_0} e^2 \frac{ 1 - \boldsymbol{\alpha}\boldsymbol{\alpha} }{|\boldsymbol{x} - \boldsymbol{y}|} \exp \Big[ iw_{ij}|\boldsymbol{x} - \boldsymbol{y}|/ c \Big] . \] The more popular Coulomb & Breit operator used in molecular physics and quantum chemistry can be derived from the above operator by expansion of the exponential and retaining terms up to quadratic order in the speed of light. After some simple but tedious operations we obtain \[ V^\mathrm{CB} = \frac{1}{4\pi\epsilon_0} e^2 \frac{ 1 }{|\boldsymbol{x} - \boldsymbol{y}|} - \frac{1}{4\pi\epsilon_0} e^2 \frac{ \boldsymbol{\alpha}\boldsymbol{\alpha} }{|\boldsymbol{x} - \boldsymbol{y}|} + \frac{1}{4\pi\epsilon_0} \frac{e^2}{2} \Big[ \frac{\boldsymbol{\alpha}\boldsymbol{\alpha} }{|\boldsymbol{x} - \boldsymbol{y}|} - \frac{(\boldsymbol{x} - \boldsymbol{y})\boldsymbol{\alpha}(\boldsymbol{x} - \boldsymbol{y})\boldsymbol{\alpha}}{|\boldsymbol{x} - \boldsymbol{y}|^3} \Big] , \] consisting on the sum of the Coulomb operator, the Gaunt operator, and the Breit retardation operator. <br><br><br><h4>Final remarks</h4><br>I have focused on retarded potentials but it is possible to obtain the advanced potentials when integrating the equation of motion taking some future time as baseline \( \rho(\boldsymbol{y},t) = \exp[L(t-t_F)] \rho(\boldsymbol{y},t_F) \); contrary to a common myth, there is no violation of causality because the mechanical equations of motion are deterministic and can be integrated both backward and forward in time. I have also focused on electromagnetic interactions but the same arguments can be applied to gravitation resulting on instantaneous potentials \[ h_{\mu\nu} = -\frac{16 \pi G}{c^4} \int \frac{\sigma_{\mu\nu}(\boldsymbol{y}, t)}{|\boldsymbol{x} - \boldsymbol{y}|} \mathrm{d}\boldsymbol{y} .\] <br><h4>References</h4><br><b>[1]</b> Electromagnetic Theory, Lecture Notes <b>2005 (Accessed May 2017):</b> <i></i> Poisson, Eric. <br><br>Juan Ramón González Álvarezhttps://plus.google.com/109139504091371132555noreply@blogger.comtag:blogger.com,1999:blog-3564535652306217326.post-59975137501916954902016-04-18T10:23:00.001-07:002017-01-11T04:03:27.786-08:00Researchgate: Are you kidding?<link rel="icon" href="http://www.juanrga.com/favicon.ico?v=2" /> I lack a Researchgate account, but I noticed that Researchgate has created a fake profile about me where they are archiving works from mine whereas miss-attributing one of them to inexistent coworkers. My paper published on the <a href="http://dergipark.ulakbim.gov.tr/eoguijt//article/view/1034000436/0" target="_blank">International Journal of Thermodynamics</a> is miss-attributed to two inexistent coworkers <a href="https://www.researchgate.net/researcher/2053054767_Juan_Ramon" target="_blank">Juan Ramon</a> and <a href="https://www.researchgate.net/researcher/2053135147_Callen_Casas-Vazquez" target="_blank">Callen Casas-Vazquez</a>, when I am the only author. <br /><br />I tried to join up to correct this blatant error, but due to lacking an institutional email, my request wasn't processed automatically but followed a manual verification procedure. I provided links to my published works and links to the works already archived by Researchgate. Moreover, during the process, the software automatically found some other works from mine. <br /><br />Today I received a rejection letter: <br /><blockquote>Dear Juan Ramón González Álvarez, <br /><br />Thank you for your interest in ResearchGate. Unfortunately we were unable to approve your account request. </blockquote><br />I accepted the rejection, because it is their site and their policies, but I mentioned to them it makes little sense to negate me an account whereas archiving my works on a fake profile with inexistent co-workers. I requested Researchgate to delete my profile and the works from their archive. I just received the next funny reply: <br /><blockquote>Thanks for getting in touch. When browsing ResearchGate you might come across a profile or publications in your name. This is most likely an author profile. <br /><br />Author profiles contain bibliographic data of published and publicly available information. They exist to make claiming and adding publications to your profile easier. <br /><br />If you don't have a ResearchGate profile yet, click on the ?Are you Gonzlez lvarez?? button on the top right-hand side of the page to be guided through the sign-up process. Once you've created an account, you'll be able to manage and edit all of the publications on your profile. <br /><br />Kind regards, <br /><br />Ben <br />RG Community Support </blockquote><br />Therefore, Researchgate archives my works and automatically generates a profile about me without my permission, miss-attributes one of my works to inexistent co-workers, rejects my request to join, doesn't solve the miss-attribution issue and finally suggests me to join to edit by myself the profile. <br /><br />Are those guys kidding or it is just plain incompetence? <br /><br /><br /><h4>Update</h4><br />Finally Researchgate has deleted the fake accounts of the inexistent coworkers <q>Juan Ramon</q> and <q>Callen Casas-Vazquez</q>, changed the profile about me to one new profile with my full name <a href="https://www.researchgate.net/researcher/2053069412_Juan_Ramon_Gonzalez_Alvarez" rel="nofollow" target="_blank">Juan Ramón González Álvarez</a>, cleaned it, and offered me to join them: <br /><blockquote>Obviously you have proved your credentials as a researcher now, I would be happy to activate your account and assign these publications to your account, should you choose that option. <br/><br/>Kind regards, <br/><br/>Thomas <br/>RG Community Support </blockquote><br />I didn't reply... <br /><br /><br /><h4>Second Update</h4><br />Just for curiosity, I checked the status of the fake profile that they still maintain about me. They have changed things again. Apart from listing a set of incorrect disciplines with little to no relation to my work, now they only attribute to me a single publication, whereas my work about heat doesn't appear in my profile but appears standalone. <table width="100%"><tbody><tr><td><div class="separator" style="margin-top:2em;text-align:center;"><a href="https://1.bp.blogspot.com/-yjA6lpNaJEI/V4Uy9GaxOFI/AAAAAAAAAFU/3Sj8OdQGXzUDH0e1g_FKUvinFLlA6tPxQCLcB/s1600/Captura%2Bde%2Bpantalla%2Bde%2B2016-07-07%2B02%253A54%253A07.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-yjA6lpNaJEI/V4Uy9GaxOFI/AAAAAAAAAFU/3Sj8OdQGXzUDH0e1g_FKUvinFLlA6tPxQCLcB/s320/Captura%2Bde%2Bpantalla%2Bde%2B2016-07-07%2B02%253A54%253A07.png" width="420" height="250" /></a></div></td><td><div class="separator" style="margin-top:2em;text-align:center;"><a href="https://1.bp.blogspot.com/-nu59DQluQSE/V4Uy7RHFmzI/AAAAAAAAAFQ/WvODIKSrnIonIhE1ECEUm93uKjhjAnGiACLcB/s1600/Captura%2Bde%2Bpantalla%2Bde%2B2016-07-07%2B02%253A53%253A52.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-nu59DQluQSE/V4Uy7RHFmzI/AAAAAAAAAFQ/WvODIKSrnIonIhE1ECEUm93uKjhjAnGiACLcB/s320/Captura%2Bde%2Bpantalla%2Bde%2B2016-07-07%2B02%253A53%253A52.png" width="420" height="250" /></a></div></td></tr></tbody></table>Juan Ramón González Álvarezhttps://plus.google.com/109139504091371132555noreply@blogger.com