We knew from the first screen shots and gameplay snippets of Infamous Second Son that the title would help to define what the PS4 was capable of. While its release has met with some criticism concerning the gameplay, there are few who would doubt that overall the game is very technically impressive. Recently a PDF has come out from presentation at GDC 2014 from Sucker Punch, providing a rather interesting insight into the engine which powers the title.
When you’re playing as Delsin and jumping across rooftops, firing lasers and beating on bad guys you might be forgiven for not noticing the rather impressive use of particle systems, massive draw distances or the huge array of animation that the citizen’s around you exhibit. But they’re there, putting the Playstation 4 through its paces as we’ll discover.
At its very core, Sucker Punch aimed to run Infamous Second Son with the resolution of 1920×1080 – and boasted up to 60FPS gameplay. In reality the title of course manages to achieve around the thirty FPS mark (usually slightly north of this) more often than not. SMAA x2 is applied post process for Anti-Aliasing.
Playstation 4 – 4.5GB Memory available?
The PS4 is most Infamous (sorry, couldn’t resist) for its memory architecture, consisting of 8GB GDDR5 Memory running on a 256 bit bus at 5500MHZ. This of course provides the 176GB/s memory bandwidth which is now fairly synonymous with the console. We learned before the system was released that game developers would be working with far less memory than 8GB. Of course, we’d expected a certain portion of RAM to be locked off for OS functionality (early estimates were around 1GB to 1.5GB) but according to Sucker Punch we’re looking at 4.5 GB of the 8 GB GDDR5 memory available to developers.
One can be forgiven for being curious as to what Sony intends on using this extra memory for. If the OS is currently using 1.5 GB, plus the 4.5GB that still leaves 2GB unaccounted for. It’s likely we’re seeing Sony reserve this memory for later use. If the OS does need more memory they can’t claw it back from games at a later date. Therefore they’re likely taking the cautious route for now. After all, we don’t know what they’ll need for say Sony’s Virtual Reality or future applications.
With Infamous Second Son, 2.5GB of memory are taken up by the games ‘main’ assets (Loaded Data), which is hardly surprising news. We’ll be seeing the inclusion of things such as textures, audio, animation and other basic assets fitting into this description.
Flexible Memory (which is the victim of a typo in the slide) is likely referring to the PS4’s Flexible Memory system. Flexible Memory is RAM managed by the Playstation 4’s OS on the behalf of the title in question, but the memory despite being managed is the games memory to use. This is supposedly in addition to another 512MB of memory that’s paged, apparently acting like you’d expect a PC’s swap file to.
It’s therefore hard to know judging by these slides if we’re including the Flexible memory into the pile, as if we add up all the memory we’re slightly over the 4.5GB limit.
Atlases and Buffers weigh in at ‘only’ 370MB total, but Texture Atlases are specifically stated are used for many purposes, and typically take up over 200MB of data on their own. Atlases are textures / materials which are packed together. This is for the sake of optimization of Draw Calls. Rather than forcing the system to farm out a draw call per texture, the atlas is used to grab the lot. This allows you to save a lot of time – for example, if there’s a single texture atlas which say a bunch of foliage has to use then this foliage can be used multiple times in the same draw sequence, rather than draw one, then another and so on.
There’s a second benefit to Atlases – texture dimensions (that is the physical size of the texture) generally has to be square. So for example, a small texture would be 256×256, a slightly larger texture would be 512×512 – in other words, to the power of 2. This is great until you don’t ‘need’ a texture that fits within those dimensions – say the object is a strange shape. If you were to use a larger texture to wrap around it’d be excessive for your needs. This means you’re wasting space in memory and memory bandwidth (larger textures take up more memory space, costing more bandwidth to move and are a larger memory footprint), but it’s also harder to work with than necessary.
Render Targets take up around 290MB of RAM which isn’t unexpected given the size of image we’re dealing with here. Infamous Second Son is being rendered in fully 1080P (1920×1080), and in the G-Buffers we’re also seeing Shadow Maps, Gloss and other effects being held. Color modes are fairly complex to explain, but as a general rule for example if we take RGBA8, this means that it’s 8 bits per channel, with 4 channels total (Red, Green, Blue and Alpha), see wiki here. This means that for every pixel we’re looking at it taking up 32 bits in memory, therefore we can do a math such as RGBA8 – 32 * 1920 * 1080 / 8. That would be the color depth by the width by the height then divided for the usual bits and bytes calculations with memory.
If we count all of buffers to calculate their memory we’re left with:
Three RGBA8 = 25mb.
Three RGBA16 = 50mb
One D32F = 8mb
One S8 ( Stencil Buffer) = 2MB (only a single channel)
If you then add all of these up, 50mb + 25mb + 8mb + 2mb = 85mb.
Command Buffers, for reference are very much what the name implies. They hold GPU commands in storage (in other words, instructions issued to the graphics card). It’s likely due to the heavy nature of compute in Infamous Second Son there’ll be a share of GPGPU commands mixed in.
Heap: is the memory that’s being used while a task is being executed. It’s the ‘dynamic’ memory of the program, and each time a variable or new object are being created it’s stored on the memory heap. The heap requires the programmer to carefully manage it. Sucker Punch dedicate 100MB for this in Infamous Second Son.
PS4 CPU Threading
Just so we’re all up to speed, the PS4’s CPU is based on the AMD Jaguar, a x86-64 bit CPU. Unlike the ‘standard’ Jaguar this is customized, and is two modules of four cores (eight total) and of course has a link between the two modules to share information. We’ve discussed the PS4’s CPU previously in the second page of our analysis of the Naughty Dog presentation and there’s a large hit in performance when you’re accessing data that’s stored in the other clusters cache. Two of the Jaguar’s CPU cores however are reserved for OS functionality, leaving six cores available for game developers to use and abuse.
Unsurprisingly in such a huge open world, Infamous Second Son relies on hundreds of ‘jobs’ being performed by the CPU per cycle and the engine is built to take full advantage of both the GPU and CPUs parallel processing. Animation, lighting, Command Buffer and particle collision are hungry beasts – having 50+ threads each.
Thread Contention is a process where a thread is left waiting on a particular resource until its been free up. For example, let’s assume you’ve got thread A and thread B. Thread A wishes to perform an operation on a specific object but currently it’s being locked by thread B. Thread A can’t do anything but simply wait until B is finished and unlocks it. This can be expensive, as can thread Kicking.
Animation uses Submits and Kicks threads at higher granularity. To define the purpose of “submit” and “kick” – submitting a thread is very similar to the name implies, you’re submitting a work item into the thread pool, and then that thread is then being ‘kicked off’ and started.
Higher granularity means how many ‘threads’ there’ll be working on things. This comes down to the developer and what they’re trying to achieve with the hardware they’re using. Higher graunularity means using more threads which do little amounts of work each rather than fewer threads which do lots of work. Obviously this can result in higher levels of communication overhead, but it depends on the tasks. Clearly high Granularity allows a great degree and freedom of control, but at the expense of complexity.
Physics and Pathing use atomics. To clarify an operation is atomic if its using shared memory and it completes in a single step relative to the other threads around it. When accessing shared data resources careful control is required to ensure that the data doesn’t become corrupt / other issues. This is extremely important when dealing with calculations, when things such as order of operations can have a huge impact on the variables result. Therefore Atomics perform a Read-Modify-Write operation, with access to the memory location that’s being ‘worked’ on blocked until the specific action is performed.
Workers are not core locked, and accordingly despite “core swapping” being inefficient, it’s better than ‘preemption’. In this case worker threads aren’t bound to a particular CPU core, and can ‘jump’ between the various CPU cores (based on what’s free and ready to go). There is inherent latency associated with switching between different cores, but in this case they’ve found with the Infamous Second Son engine that it’s more efficient (better performance) than the preemption alternative. Preemption is the act of pausing a task and then resuming it at a later time, this is so that all processes get their time slice and fair share of CPU time.
According to Sucker Punch, the PS4’s CPU is capable of handling 30,000 Draw Calls per second, along with animating 50 – 100 animated characters, each containing 300+ bones each. Interestingly enough the AMD Jaguar’s Prefetch isn’t as effective as the PS3’s Cell processors DMA for some situations. We’ll explore this more in an upcoming analysis.
PS4 GPU Compute In Infamous Second Son
Infamous Second Son as expected does use GPU compute for a number of different tasks, including lighting, particles and much more. The PS4 as we’ve previously discussed uses the Radeon Graphic Core Next (GCN) architecture from AMD for its graphics. This architecture has been improved somewhat, featuring some of the same improvements that have made their way into the ‘Volcanic Island’ GPU’s from AMD. Specifically, these are beefed up compute queue structures which better allows graphics and GPU to be implemented. Sony also implemented ‘Volatile Bit’ in the Level 2 cache of the GPU, which effectively tags the lines of compute code stored there so that they can be modified / erased selectively instead of requiring an entire Level 2 cache flush. The Level 2 cache is also fully coherent, so each of the Compute Units (each CU holds 64 shaders) can see and access the same piece of data.
Sucker Punch state that the PS4’s GPU compute has both positives and draw backs. Due to the nature of GPU’s, they’re incredibly parallel. This means that the hundreds of processors (shaders) which make up the PS4’s GPU can be used to perform a number of different tasks. Because they operate on a SIMD (Single Instruction Multi Data) format, one or several (or more) of these shaders can come together to work on a particular task. The PS4’s GPU is easy to use, and with a large cache its perfect for the ability to load up the data and have it run without needing to queue it in the more limited number of cores available to the CPU.
The GPU however isn’t without a few problems. The obvious one is that it’s effectively sharing the same resource which is responsible for graphics – and indeed all the careful queuing in the world doesn’t provide more power. In the end the PS4’s GPU provides 1.84TFLOPS of performance which is being split over multiple tasks. Optimization is still extremely important to ensure that frame rates don’t take a hit when stuff starts to happen on screen.
The PS4’s GPU and CPU are located on the same physical die (an APU) and data is shared between the CPU and GPU in a fast manner. There’s been some disagreement on the speed of the interconnect between the AMD Jaguar CPU and the GPU, but most would agree it is indeed 20GB/s. The PS4’s memory architecture is indeed HUMA (Heterogeneous Unified Memory Architecture) which means the CPU & GPU are addressing the exact same memory space, for more info click here.
It’s important that the data is correctly synced in GPU/CPU processing so that both are effectively addressing the right piece of data that’s also in the correct ‘state’. Due to issues such as uncertainty of when a piece of data will arrive, time to launch along with transfer times (from main memory and cache), its important to optimize as much as possible. But due to the unpredictable nature it’s hard to simply do this. GPU’s are known to give great performance in highly parallel throughput oriented tasks, and therefore can better hide the latencies which are associated with this.
Despite the PS4’s HUMA architecture largely removing the need for memory copy because of the shared memory space, there are a few other issues. A large part of the overhead and performance loss comes from the ‘coarse granularity’ of the sync. Typically the GPU works on a piece of data while others are being fetched for it to work on, thus a constant flow of data is being provided.
Long running compute however can be difficult to run due to the issues with sometimes needing to prioritize graphic work ahead of say computing physics. Divergent Branches (which is when a thread in the group (the same group that it’s being executed on) takes a different path during a conditional branch. In other words, when a thread inside the group (Nvidia call these groups warps, AMD refer to them as Wavefronts) branches out to a different execution path, which can lead to a huge loss in performance. Careful use of labeling and control paths are crucial to minimize this.
Frontloading the compute operations is vital, rather than a more ‘one at a time’ approach seems to reduce the latency for the commands. Reducing the register count (the virtual registers on the GPU) and ensuring the occupancy (operations which are running on that particular GPU processor) are optimized are also essential.
That’s it for part one folks – part two will be up soon and we’ll be diving in to the visuals of Infamous Second Son. Special thanks to The Kid on RGT for the help!