The Playstation 4’s Job System is very similar to design of the PS3 (hence the reason we’ve just taken a look at it). Each of the six cores from the AMD Jaguar are responsible for performing different tasks. Core Zero holds the “Main Game Loop”. It’s this threads job to assign the workloads to the other five CPU threads, and essentially manage them. This is similar to the orcestrator thread from Kilzone Shadow Fall Post Mortem we took a look at. Each of these CPU’s is in many ways more powerful than the PPU of the PS3, primarily because of the way it handles branch prediction and other code (we’ll get to that).
‘Mastery of your hardware’ as Naughty Dog puts it is the secret to great performance. Understanding and knowing memory caching, superscalar CPU architectures and of course branch predictions and more are critical. Naughty Dog praise the 80/20 rule, although some programmers refer to it as the 90/10 rule. It simply means that 80 percent of the time a piece of code is running will come from a pool of code that’s only about 20 percent of the code you’ve written. This means that for best ‘bang for buck’ when it comes to optimization, it’s better to focus on optimizing the 20 percent of frequently used codes that’s running the majority of the time, rather than the remainder which is less frequently used.
In the example you’ll see above, you’ll note that the GPU is indeed included in the PS4’s Job System. The GPU lists GPGPU (General Purpose computing on the Graphics Processing Unit) which are split into wavefronts along with Rendering. Rendering is the easiest of the two to explain – it’s simply the games graphics, so character models, lightning, textures, polygons which create the level – whatever, are handled by the rendering. Wavefronts are a little more complex however.
When the designers of the PS4’s hardware created the unit, it was designed with GPGPU in mind. In other words, its designers envisioned game developers would like to run cloth simulation, physics, fluids and perhaps AI all on the GPU. The GPU is a parallel processing monster, capable of having thousands of jobs scheduled at once. A Wavefront (Wavefront is AMD’s term, Nvidia however call them Warps) is the most basic form of scheduling possible on he GPU. Each Wavefront consists of 64 threads. Because of its SIMD (Single Instruction Multi Data) roots, it means that the GPU will schedule the code to run on available SP (which we discussed in the previous page). Multiple processors can run the same piece of code at once, so you can think of it as moving a large heavy box. If one person can’t pick the item up, other people will schedule a few moments of their time to help lift the box up and carry it. In an emergency, certain other tasks hold a higher priority however, which is what the PS4’s ACE handles. It balances the rendering and GPGPU work, to ensure that the frame rates for the game don’t tank.
The GPU is faster at processing certain data, and despite the CPU being ‘smarter’ and better able to handle a wide variety of different instructions, the GPU holds considerably more raw power. Of course with gamer’s expecting 1080P visuals, much of the performance can be sucked up simply rendering the game, which is why the PS4’s Volcanic Island GPU improvements are so vital.
In the case of the PS4’s it handles data similarly to the PS3, with a main thread running and then kicking data over to the other cores, but of course you’ve also got the PS4’s GPU thrown in the mix too. This is precisely the reason the Playstation 4 uses HUMA (Heterogeneous Unified Memory Architecture). HUMA means that the CPU / GPU are speaking to the same memory pool and require no copying of data backwards and forwards. There are three memory configurations we could consider: The traditional memory configuration of memory, like the PS3. This had 256MB for Video and another 256MB for System. The second option would be similar to the X360. 512MB total which could be split up however developers would like, for example they could have 200MB for video and the remainder for general, or 400 for video and the remainder for general – it was up to the developer.
HUMA is more advanced because it takes the 8GB (well five if we look at what’s available to games) and doesn’t have a separate address space for the CPU / GPU – they can both access the same piece of memory. This means that there isn’t the requirement to copy data from one location in memory to another and improves the performance accordingly. The Playstation 4’s CPU / GPU bus structure is an extension of that, requiring three buses total. For more on HUMA on the PS4 click here
With the buses the PS4 has three, the first is running at the highest speed, 176GB/s and is known as the Garlic – connecting the GPU to the main system RAM. There’s another bus (Onion) connecting the PS4’s memory to the CPU, which is 20GB/s, and finally the CPU to the GPU. Chris Jenner said in an interview regarding the PS4’s buses (he worked on porting the crew from the PC to the PS4)
“The first performance problem we had was not allocating memory correctly… So the Onion bus is very good for system stuff and can be accessed by the CPU. The Garlic is very good for rendering resources and can get a lot of data into the GPU,” Jenner reveals to Eurogamer.
“One issue we had was that we had some of our shaders allocated in Garlic but the constant writing code actually had to read something from the shaders to understand what it was meant to be writing – and because that was in Garlic memory, that was a very slow read because it’s not going through the CPU caches. That was one issue we had to sort out early on, making sure that everything is split into the correct memory regions otherwise that can really slow you down.”
Memory Caching and PS4:
Game engines often require a lot of code to be very small, and if it’s well optimize it can therefore be placed inside the CPU’s cache. For the purpose of the Playstation 4 (and the AMD Jaguar) there are several options for the CPU. The CPU can fetch data from either the main system RAM, the Level 2 cache, the L1 Data or L1 Instruction cache and finally the Registers of the CPU. In the case of the Main system memory, it’s the slowest to pull information from, but is the largest, costing 220+ cycles. For clarification, an Instruction Cycle is the most basic of operations in a computer. It basically means the the ‘cycle’ a processor retrieves a piece of a program from its memory. Unlike clock speed, where higher numbers are good, for cycles it means a longer wait time for the code to run, therefore you want as fewer cycles as possible, particularly with regularly used code.
Next up we have the two caches, Level 2 and Level 1. In the case of Level 2 cache we have it access at around the 30 cycles mark, which is over 7 times the speed. Clearly there is a heavy level of optimization considering only 2MB is available per core. Finally for the caches is a Level 1 Data and Level 1 Instruction cache. These are smaller still, at a tiny 32KB, but at only 3 cycles, it’s over 70 times the access spared compared to the GDDR5 memory. This small amount of memory is left for the high performance pieces of code, and fitting all of the code into data means heavily optimized performance. Finally we come to the Jaguar’s Registers, which are ‘free’ for access time, but are tiny, and able to store only the smallest of instructions.
As we’ve discussed before the PS4’s CPU is comprised of eight cores, which are split into two modules of four. Each of these modules has its own set of Level two cache, and for the sake of performance you don’t want the modules to access the others cache. So for example, let’s assume you want core three (which would sit in the first module) to process a piece of data which is being held in the cache of the second module. Doing so incurs a large performance penalty, coming in at around 190 cycles which is over six times slower than accessing data from its own shared level two cache.
It’s important to remember that if read a single byte of data then you’ll end up loading an entire 64 bytes into cache. While this might not sound large, when you consider that in the case of the level 1 caches, we’re dealing with only 32KB (per instruction and per data) this can quickly fill up. Obviously however 1024 bytes are what make up a KB. Therefore you’re required to do creative thinking. For example you can use padding to make sure that the data is placed correctly in memory to avoid the cross access of L2 data.
PS4’s Pipelined CPU
The Pipelined CPU is a fairly simple concept to understand once you grasp the basics. It’s a set of processing elements which are connected in a series. The answer to one is the input to the next. A CPU works by Instruction Fetch -> Instruction Decode / Register Fetch -> Execution -> Memory Access -> Write-Back. This is fairly simple unless there’s a branch in the code – also known as a if / else statement. In these cases you’re requiring the CPU to do a prediction which it can either guess right or wrongly.
The AMD Jaguar cores are Out-of-Order execution CPU’s, and with advanced branch prediction. All of this means that they’re much better at ‘guessing’ which bits of code are going to come up next. And if something unexpected needs to be processed, there’ll be less wait time. This in turn means less reliance on efficient and correct compiling of the application. Although in a perfect world the application would be compiled perfectly, AI and other code which can branch off into dozens of if’s and elses can be extremely hard to predict and to load into the CPU’s cache or registers. Therefore more efficient OoO (Out of Order Execution) and Branch Prediction becomes essential for the Jaguar.
In the PS3 much of the branch prediction was pretty awful, but now the PS4 you don’t have to concern yourself with as much. This just helps to demonstrate that the PS4 is by far the easier machine to work in, compared to the PS3. For more on Branch Prediction and how it works check out our analysis of the PS3’s cell processor.
Part One
Sources
AMD GCN White Papers
MSDN
GameEngineBook
Extreme Tech