In this second part of our post-mortem, we’ll take a look at how Sucker Punch Productions handles the GPGPU structure of the PS4, more detailed breakdowns of the games double buffering system and we get some confirmation on the memory access speeds of the CPU. It’s been awhile since the first part of the Infamous: Second Son Engine Post-Mortem, but finally we’re diving right back into the GDC 2014 presentations.
During this second part we’ll assume that you’ve a fair knowledge of the PS4’s basic inner workings, but if you’re not then you can check out the basics of the PS4’s hardware here to get you up to speed.
PS4 Bus Structure, Compute & GPU Performance
The PS4 relies on the GPGPU (General Purpose computing on the Graphics Processing Unit) to do a lot of the heavy lifting with the console. This is achieved via a method known as Asynchronous Compute, the basic premise being that the CPU schedules jobs for the GPU to run (more on this later). These compute tasks will then be combined with the pixel data of a frame and effectively ‘mixed’ into the final image which you’ll see on screen.
Interestingly enough Sucker Punch highlights the major weak point of the next generation Playstation to be the CPU. We’ve previously discussed the CPU of the console at length, but as a reminder, you’re looking at a X86-64 bit CISC (complex instruction set computing) CPU known as the AMD Jaguar. It’s a low power processor, sporting four cores – although Sony have a modified version which is comprised of two modules, each housing four cores (meaning eight total). Of these eight, two are reserved for OS functionality, which leaves six for game developers.
Sucker Punch said in a note in their own GDC 2014 Post-Mortem (regarding the CPU) “While the CPU has ended up working pretty well, it’s still one of our main bottlenecks. It’s also less easy to optimize after the fact because of the out of order nature.” For clarification, the PS3’s Cell CPU was “In-Order” meaning that it expected the data to flow / branch in a specific way. If things didn’t go as planned the CPU would lose performance. This meant that it was vital for developers to optimize their code, and there was a heavy reliance on the compiler to ensure things were as they should be. See our PS3 post-mortem here for more info. The Jaguar meanwhile follows “Out of Order” and much better prediction, but ironically according to Sucker Punch means there’s less wiggle room to optimize the code.
Strangely this follows very closely to what Microsoft said about the Xbox One when they were defending its specs before release. Those with a sharp memory might recall the X1 receiving a clock speed boost on both the CPU and GPU. The PS4 and X1 share the same CPU, and Microsoft had the overheads left (from heat and power) to crank the dial up on the X1’s CPU clock from 1.6 to 1,8GHZ. They later commented that the increase in CPU clock speed had provided them the larger performance boost than when they’d increased the GPU clock, hinting that the X1 CPU was a slight bottleneck.
Further more, we’re also provided a little more confirmation regarding the Playstation 4’s “Bus Structure” and its GPU performance. The GCN GPU inside the PS4 has 1152 shaders, putting out just under two trillion float point operations per second (1.84 TFLOPS) – but according to Sucker Punch in their Particle engine analysis of Second Song -“Theoretical Peak Performance unlikely to achieve in real life, but even if just a fraction, that’s a lot of computation’.
The GPU Only memory access is very fast at 176GB/s, but it’s long been suspected that the CPU doesn’t have the same type of bus speed access. Inside the PS4 are three buses. We’ve discussed this previously in our Naughty Dog Sinfo Analysis if you’re interested. The main bus people know about is also the fastest, it provides the GPU’s access to the memory at 176GB/s (known as Garlic). But there’s also another bus the CPU uses for memory access and it’d been guessed to be between twenty to thirty GB/s memory bandwidth. Although the exact number isn’t stated by Sucker Punch, they say “CPU bus bandwidth < 20GB/s, 10x slower [than the GPU bus]”. This means that the CPU bus of the PS4 is around the 17 to 20GB/s mark. It’s good to see that our suspicion regarding the bus structure would appear to be correct.
The CS (Compute Shader) is dispatched much like a Draw Call (a call for the GPU to ‘draw’ an object, be it a texture, a box, or whatever) to the GPU. Setup of the parameters is handled by the CPU and sent in a Dispatch (which are a number of different thread groups), and a thread can read/write arbitrarily. A Thread Group is just a bunch of different threads together, and from the perspective of AMD GCN (at least currently) each Thread Group is made of 64 threads – more about this here.
The studio use a ‘programming’ tool of sorts for the graphics artists known as “Particle Expressions” which is focuses on allowing the visual artists to create particles and other effects easily and without diving into code. They are able to use a bunch of “expressions” which define various functions, including the size, position and color of the various particle, it’s very much like Microsoft’s Excel in the basic way it operates. Expressions aren’t a ‘new’ programming concept, and even in Java there’s what’s known as a Unified Expression Language which is a specially created programming language to make it easier for web designers who’ve limited or no experience in the java scripting language to implement java functionality. This same principal applies here. The basic idea is to empower the artists and have them create the visual style of the game but without the artists either needing to use code, or to have the programmers time then used up in coding their visual effects. The whole principal is focused on improving the work flow of the studio and obtaining easier to view visual results.
This however means that there’s a two stage compile of the games data, Expression is converted into PSSL (Playstation Shader Language – the PS4’s API which we’ve covered on the Youtube channel) and finally the PSSL is then compiled into the machine code. The team note that the output of the compiler isn’t well optimized and trusting in the compiler is misplaced. They also believe that there’s room for improvement in this area.
The code from expressions once it’s been turned into PSSL has then injected into the remaining code via a “#defines” and command-line parameters to the final PSSL compiler. The #defines simply means that it’ll place the piece of code where it’s defined. This is typically used a lot in C++ and other coding languages for common statements (for example loops) but in this case they’re simply using it for another purpose.
Double Buffering Compute & Pixels
Infamous Second Son is “Double-Buffered, so in this case there’ll be one frame being computed on the PS4’s CPU while drawing the previous frame. The computing which the Jaguar CPU handles spans a variety of different tasks (a lot of which we’ve covered in the first part of this article), but includes basic gameplay, game engine physics and so on. Double (or even triple) buffering has been around for a long time, and in its most basic form (when GPGPU isn’t present) it works by having a front and a back buffer. The back buffer is what’s currently being drawn (in the background) but the second buffer (known poetically as the front buffer) is the one that’s being displayed on screen. These two buffers always exist, but a “buffer” swap will constantly happen so the front buffer now becomes the back buffer, and the back buffer takes on the role of the frontal buffer.
So in short with Infamous Second Son, the display on screen is always a frame behind what’s actually currently being computed on the CPU and GPGPU (in other words, using the GPU’s Compute Shader) and then that draw will eventually be pushed on screen.
Infamous Second Son relies on particles for a massive amount of different tasks, everything from Delsin’s smoke power to an exploding car rely on particles being correctly computed and rendered on screen. You’ll notice indeed that forcing the game engine to calculate a lot of particles (particularly in an open environment with lots of bad guys and explosions going off in all directions) the frame rate will obviously suffer a little. During the GDC 2014 conference Sucker Punch provided a little information regarding the usage of compute for Second Son.
Sucker Punch note there’s multiple ways compute can run, the first would be similar to a regular Graphics Pipe – with the compute instructions effectively ‘thrown in’ with things such as vertex and pixel shaders. The second option is “Asynchronous Compute Queues” which effectively run at the “same time” as the draw tasks. The hardware itself has schedulers (this has been covered in depth here) and allocates the various shader resources of the GPU to do what’s necessary. Remember (as the previous link shows) the PS4 has a beefed up queue structure, and Sucker Punch point out that you can use a “fancy API to prioritize, but we don’t”. Unfortunately there’s still a lot of CPU time eaten up by particles and compute, as it’s down to the CPU to setup and dispatch the compute tasks.
According to Sucker Punch, in I:SS runs around 30K particles per frame of animation, although their engine can push up to 128,000. Rain on the other hand is fairly costly, pushing the count up between 30 to 40K.
The rather complicated nature of how compute data for the particles flows around the game engine is seen above. The game engine must keep track of each and every particle on screen, this means their position, color, transparancy, emit time and so on. To begin with a particle is emitted (spawn and created) for a particular frame of animation, an updater then runs over all the different particles on screen, killing particles which have ‘aged-out’ (that is, imagine a spark from a fire, it doesn’t burn forever, it will burn out as it flies away from the fire), along with adjusting each particles relative position in the X, Y and Z coords.
The Syncing Feeling
The Playstation 4’s GPU contains 1152 shaders and so compared to a GPU there’s significantly more threads being handled simultaneously by the hardware. To make the most of this developers have to think in parallel and ensure that the data from the CPU and GPU are both synced with each other. This is vital to ensure both devices are smoothly passing data backwards and forwards. Remember the PS4 does use the HUMA (Heterogeneous Unified Memory Architecture – article here), meaning the GPU and CPU both share the same memory address space. Because the CPU and GPU are both utilizing and addressing the exact same memory space memory copy operations are minimized. That is, a command doesn’t need to be issued from the CPU, copied to the GPU’s memory for it to process and so on. Instead, the CPU and GPU can just speak to the same bit of data without the penalty of copying.
That isn’t to say that this system doesn’t have any weaknesses however, one of the primary being syncing the data, particularly when you’re dealing with fine grained computing. Interestingly enough, there’s still the issue of copying data from the cache of either the CPU’s or the GPU, or even the registers of the processors.
Ensuring that the data is synced becomes vital when thousands of instructions are being along in parallel and you’re dealing with huge buffers. The Playstation 4 as you’d expect has multiple ways to deal with this, but due to the nature of separate caches in the cases of GPU/CPU synchronization it can get complicated really fast. Sucker Punch point out that it’s sometimes better to avoid using the cache (the Garlic bus of the PS4 actually bypasses the caches of the GPU). Sometimes the only way around this is to issue a “sync” command to the GPU’s buffer – but it’s not something you wish to constantly issue.
If you issue the command you’re effectively asking the system to finish all of the previous commands before the next set are issued, therefore causing a slight blockage. The CPU can’t issue more commands, and the GPU is effectively just processing the commands already issued as you’re effectively in Sucker Punches own words “blowing the pipelines” – likely due to latency. Parallel computing typically has a slight latency attached to it, but due to the constant flow of data between the CPU and GPU it’s usually not something which affects performance.
In the case Sucker Punch provided with its GDC 2014 discussion of particles, they run the emit programs for each of the emitters (emitters spew forth the particles), they’ll then have a sync command issued and run an update of all the programs. Another sync will occur and then they’ll ribbonize, sync and then finally issue a sort of the particles.
Part Three – the visuals coming soon