Currently there’s a trend discussing the performance of either the PS4 and Xbox One and placing much emphasis on the GPU and memory bandwidth of the systems rather than the CPU performance. This shift of focus to the GPU isn’t anything new, particularly to PC gamer’s – indeed it’s been steadily happening for some time (a fact both AMD and Nvidia would remind us of at regular opportunity). The GPU’s performance has risen tremendously in comparison to that of the CPU, with performance per watt scaling at a much better rate. Both Sony’s and Microsoft’s respective machines are powered by the AMD Jaguar, which barely breaks over 100GFLOPS of computing power. You could be forgiven for the initial reaction to lay the blame squarely at the feet of the consoles CPU’s for being “low power” but even the highest performance CPU’s of the PC generally fall at only around the 200GFLOP range. The PS4’s GPU meanwhile puts out almost 2TFLOPS of computing performance and that’s slow compared to a gaming PC’s where we can already buy GPU’s of at least 5TFLOPS+, and with dual GPU’s you’ll reach 8 to 10TFLOPS of computing performance (single precision of course).
As one would expect, there’s no singular reason for the performance gulf between the GPU and CPU, rather a combination of different factors leading us to the current state of affairs. The process size of CPU’s have shrank significantly, where CPU’s used to be 130nm, and shrank down to 90 (current CPU’s are reaching 22nm) and while this process shrinking is slowing down, it doesn’t begin to tell us the whole story by itself.
The traditional metric CPU performance has been measured by is clock speed – how fast we clock the transistors on the chip in question. If one was to look back even in the early 2000’s at the clock frequencies both AMD and Intel were pumping out and then compare those frequencies to modern CPU’s little has changed. Most CPU’s now are remaining around the mid 3GHZ clock frequency which they’d reached back in the Intel Pentium 4 era. It’s not because transistors have stopped becoming “smaller and faster” it’s a slightly different issue – heat. Running millions (billions) of transistor generates lots of heat. This is mething Intel learned all to well with the Pentium 4 (Netburst) architecture. Originally Intel planned to reach 10GHZ(!) with Netburst, but reality struck and they were stuck at around 4GHZ.
That’s not to say that clock speed was the only factor previously – one example would be Intel’s Pentium 2 vs AMD’s old K6-2 CPU. The Intel CPU would typically ‘win’ in performance benchmarks or applications which required greater floating point performance for example, even if the two CPU’s were evenly matched in pure clock speed and with equivalent RAM.
Modern CPU’s haven’t really drastically improve their performance on a clock vs clock basis either, Intel’s Haswell architecture hasn’t massively improved speeds over the Sandy Bridge – despite the introduction of AVX2 (Advanced Vector Extensions) over its Sandy cousin. Indeed, users of even the Intel I5 750 (released mid 2009) is only about 40 percent slower while comparing IPC (Instructions Per Clock) compared to the Haswell I5 4670.
Now, CPU’s have focused on “going wider” and more efficient to improve their performance, rather than simply cranking up the frequencies. Previously there was typically a single CPU core, and improvements in cache speed, extra instructions (such as SSE) and so on were combined with large increases in core frequency to really improve the architecture. Now we’re forced to “go wider” which means packing several cores (two, four, six and eight cores are common as of time of writing) on the silicon and keeping the clock frequencies more modest.
It can be more challenging writing code that runs on multiple CPU’s simultaneously, as each of the threads needs to be “in sync” with each other and ideally the workload evenly distributed across all available CPU cores. Typically this wasn’t the case in early games (and even up until fairly recently) main titles would have one CPU core loaded at near 100 percent time usage while the other remaining cores might idle in the mid twenties or thirties. Therefore one of the challenges developers have had to overcome is their approach on writing game / application code to ensure that it’s written for multi-core machines.
CPU’s are developed and optimized for sequential serial processing while GPU’s are designed to process all of the data simultaneously, effectively juggling hundreds if not thousands of jobs all at once.
Enter the GPU’s Parallel Computing
GPU-Acclerated computing works by having certain compute intensive tasks onto the GPU while the remainder of the game / applications logic remains running as usual on the CPU. You will typically get situations where you’ll have for instance 5 percent of the code running on the GPU while the rest runs merrily on the CPU.
CPU’s contain a lot of “control hardware” whose job it is to control the day to day operations of a chip and running the PC or device. This in turn allows the CPU to accept a much wider variety of commands (for example managing memory – we’ll get to that in a moment) but in turn eats up space on the chip. GPU’s meanwhile offer far simpler control hardware, but can therefore dedicate this space to many more simpler cores. Rather than say four large and complex processor cores, modern GPU’s typically have over a thousand simplified cores (often known as shaders or ALU).
One can think of the CPU as the “smart” component in the machine and the GPU as the faster and more powerful (and yet stupider) work horse. The CPU can do a lot more tasks, is much better at decision making, can control all the other hardware in the machine (including the allocation of resources such as how much memory a program uses for say sound allocation) but because it has been created to be so “general purpose” and such a jack of all trades a sacrifice has to be made. As we’ve just said, this sacrifice is made in silicone – while the GPU hasn’t got to pay this price of control hardware. Instead it can focus on running a large amount of shader cores. Because GPU’s are more latency tolerant less die space typically is allocated to the GPU’s cache, instead a shared level 2 cache is typically only say one or one and a half MB and another small amount of cache per SMX or Compute Unit (which are clusters of shaders).
As hinted by their less reliance on cache, GPU’s optimize for pure raw throughput – that is in a typical scenario a GPU is focused on how many tasks it can get done in a certain amount of time. It isn’t so worried on how long a specific task takes. CPU’s on the other hand are the reverse and optimize for pure latency – that is they’re not focused on getting tons of tasks done, rather they’re worried over reducing the time of a per task basis.
So what we have now in the world of GPGPU (General Purpose computing on Graphics Processing Unit) is a situation where the CPU issues commands for the GPU to process along with the graphics processing. The CPU handles much of the logic and control of the commands, simply telling the GPU how it should process the command and when it should process the command. The CPU will typically move data from the CPU’s memory to the GPU and then launches the Kernel on the GPU and then finally copies the processed data from the GPU back to the CPU area.
In certain situations the data flow between the two is slightly different, such as the PS4’s HUMA (Heterogeneous Unified Memory Architecture). This is due to the CPU and GPU sharing the same memory space, so memory copy tasks aren’t required (except in certain situations such as if the data is held in a say a cache) but other actions such as the CPU kicking off the job on the GPU are still present. For more info on HUMA click here.
The GPU is fantastic at launching lots of threads, even the older and slower GPU’s don’t sweat managing running a 1000+ threads in hardware. Typically applications are written as if in serial, aside from a small piece of code which tells it how many threads to launch on the GPU. You can say “run 1000 threads” or “run 100k threads” all depending upon the hardware and speed you’re trying to achieve.
Both next generation consoles use variants of AMD’s GCN (Graphic Core Next) GPU architecture. In the case of this particular hardware, threads are created in groups of threads (64 threads per thread group – known on AMD hardware as a wavefront) and are launched on the shaders to compute. Sony’s Playstation 4 (much like the Volcanic Island GPU’s) utilizes improved compute queue and engines, which are responsible for queuing and executing the commands. These commands can typically be scheduled to run simultaneous to graphic commands (a simple example would be dedicating say 52 shaders to compute and the remaining 1100 to graphics), or told to be processed at different parts of the frame (before drawing) and more.
Let’s construct a rather pointless program – but it’ll serve as an example as a basic explanation. Suppose you want to do the following: A: One Million, B: 2. You want to take A and divide by B and copy the results into C. Then you’d want to KEEP the value of C, and divide by B once again, copying the result into E and you want to keep doing this until you’re left with the number closest to 100.
It would be the CPU’s job to allocate memory and then copy the operation to the GPU’s Memory (if applicable) and tell the GPU “Hey, here’s this data and I need you to do the following with it”, the GPU will then process the data. The results will then be copied back to the CPU’s memory and that’s that.
As you’d expect, there are some issues with this – even in systems where you’re dealing with HUMA – syncing data. Indeed, during our breakdown of the Infamous Second Son particle discussion by Sucker Punch, the studio admitted syncing the data was a real performance killer. Because of the overhead associated with allocating resources, syncing data (particularly when caches are concerned) and so on developers must be careful whilst choosing what code is best to run on GPU compute. They’ve also got to be aware (in the case of video games) of the negative effects it may have on frame rates (as you’re potentially leaving fewer resources to graphics processing). Therefore carefully selecting which tasks are best suited to the parallel nature of the GPU and ensuring they’re as optimized as possible is critical.