Sony launched the original Playstation 4 roughly three years ago, and at that time the notion of iterative hardware upgrades for consoles was the furthest thing from the minds of customers. But this all changed by summer 2016 when leaks and rumors pointed to Sony (and Microsoft) decided that a mid-generation upgrade was the best option to keep up with the ever increasing demands for visual fidelity, higher resolutions and smooth frame rates.
Just before E3 2016, Sony reluctantly admitted that the device was real but it wouldn’t be shown off during the show, and instead we had to wait until mid October at a special PlayStation event where Sony bared all. Those expecting a last minute panic upgrade in response to Microsoft’s Project Scorpio (which the company confirmed would feature a 6 TFLOP GPU) would be disappointed, as the ‘basic specs’ of the machine were identical to those which have leaked out. But, despite the CPU being clocked only 30 percent higher, and a GPU capable of ‘just’ 4.2 TFLOPS, the consoles lead architect, Mark Cerny hinted that there were secrets lurking inside the hardware which would mean that the console was indeed more capable than a simple numbers by numbers comparison would indicate.
https://www.youtube.com/watch?v=b1UWs8myhZU
During a recent conference held at Sony’s new San Mateo HQ, Mark Cerny was quoted in saying “When we design hardware, we start with the goals we want to achieve. Power in and of itself is not a goal. The question is, what that power makes possible.”
To Mr. Cerny, the PlayStation 4 Pro isn’t a new console generation, it’s simply the original PlayStation 4, albeit on steroids; and this has a number of bonuses, development time being the most obvious of those. For Sony, the design of the PS4 was a marked departure from the PS3 – which used a combination of a Cell Processor, an Nvidia custom graphics chip based on Nvidia’s GeForce 7800GTX GPU and two separate memory pools. While the PS3 was no slouch (far from it), the console was notoriously hard to develop for; with the Cell’s SPE and SPU’s requiring experience and skill to maximize.
To Sony, a new generation of consoles is a good time for developers, but also confusing and a lot of work. In Cerny’s estimations 25 percent of developers left the game industry when the PS1, Saturn and N64 sprang on to the scene. The PS3 also was tricky thanks to the Cell architecture, and so the PS4 was designed to be simpler for developers to understand. But according to Cerny he’d been forced to write a 434-page document detailing how to get the most out of the consoles more advanced GPU. One can make the presumption much of the bulk of those 434 pages would be used to describe how the PlayStation 4’s GPU computing functioned, which would be critical to get the most of the then new generation.
Cerny wasn’t keen to repeat the experience with the PS4 Pro, “we showed Days Gone running on PS4 Pro at the New York event. That work was small enough that a single programmer could do it. In general our target was to keep the work needed for PS4 Pro support to a fraction of a per cent of the overall effort needed to create a game, and I believe we have achieved that target.” In an interview with with the Japanese website AVWatch Cerny also pointed out that getting the PS4 Pro version of Days Gone running only took “two months,” and more astonishing still was that the initial part of the work took just “one person is finished in about three weeks.”
The PS4 Pro’s GPU (and other components) seem rather conservative given the performance delta needed to power a native 1080P display versus a 4K image. 4K requires quite literally 4x the number of pixels are pushed on screen, a stark contrast to the GPU’s upgraded performance of just 2.28x the vanilla machine. But, that’s where Sony’s various ‘tweaks’ to the PS4’s existing architecture (and a number of new AMD technologies) come into play, so let’s start exploring them, shall we?
“If you want to play a game from 2-3 years ago that hasn’t been patched or tested, is we just run that at 1.6 gigahertz,” Cerny explained on Sony’s decision to not be more aggressive with their choice in CPU. “For variable frame-rate games, we were looking to boost the frame-rate. But we also wanted interoperability. We want the 700 existing titles to work flawlessly,” Mark Cerny continues. “That meant staying with eight Jaguar cores for the CPU and pushing the frequency as high as it would go on the new process technology, which turned out to be 2.1GHz. It’s about 30 per cent higher than the 1.6GHz in the existing model.”
In a more recent interview with Japanese website AVWatch (thanks to Google translate, so I’ve needed to clean up the wording slightly). “The number of computing units doubled from 18 to 36”, the operating clock also “14% increase to 911MHz” added Cerny, so now the PS4 Pro sports 4.20 TFLOPS from 1.84 TFLOPS of the launch system (or the PS4 Slim). The operating clock of GDDR5, the consoles main memory is also “up 24%, bandwidth is 218 GB/s” (original clock of the memory was 5,500 Mhz).
Cerny also reaffirmed the PS4 Pro’s SOC using a 16 nm FinFet process, which is the same as the PlayStation 4 Slim. Because the size of the GPU increased by 2x the original machine, it eats up much of the SoC’s space, and despite the smaller FinFET process (16nm vs 28 nm of the original machine) a larger body was needed to keep the PS4 Pro’s innards, which Sony pegs at a 19 percent larger volume compared to the original models.
So, given Sony’s comments games which run at a variable frame rate are the reasons the team stuck with the Jaguar technology. Switching to a ‘faster’ x86 architecture could have brought in issues thanks to a faster IPC (instructions per clock) which means lowering the clock speed to match the launch model PS4 wasn’t possible. A patent filed back in February 2015 (hinting the PS4 Pro’s development time was quite the journey) highlights how much of this was achieved. As the game is loaded, the PS4 Pro does a check to see if the title has a ‘Pro / Neo mode’. If the answer is yes, then the clock speed of the CPU will increase to 2.1 Ghz. If the answer is no (the game hasn’t been coded to take advantage of the pro), the PlayStation 4 Pro’s CPU will downclock to the vanilla systems original 1.6 Ghz (think of it similar to how a modern desktop CPU can use speed step to raise or lower the processors multiplier based on workload).
For those unfamiliar with the PlayStation 4’s (and for that matter the Xbox One’s) choice of CPU, the Jaguar is an X86-64 processor designed for low power devices. Both console manufacturers opted for an 8 CPU core configuration with the Jaguar, which is comprised of two quad core’ modules residing in the systems APU (the same APU housing the GPU and other components too). When the consoles were first launched, 6 of the 8 processors were available for game developers to run their code on, but as the console generation progressed Sony and Microsoft released OS updates which essentially ‘unlocked’ the 7th processor core for developers use. The amount of processing time on the 7th Jaguar core is dependant upon what else is going on in the system at the time (for example, Kinect voice usage on the Xbox One siphons up much of the performance of the 7th core).
Imagine an updated x86 CPU which had a 10 percent higher IPC than the Jaguar – or other changes. If the clock speed reduced to 1.6 Ghz, it’s not accurate – because the IPC improvements. But IPC may not be ten percent across all tasks, so you also couldn’t simply ‘downclock’ further to offset this. It’ll be interesting how Microsoft counters these issues with their Scorpio system, particularly given the system features a much faster GPU, and Phil Spencer has recently gone on record touting how ‘balanced’ the components (CPU, GPU and RAM) inside the Scorpio are.
But surely x86 is a great leveller? Surely upgrading the CPU shouldn’t make a difference – after all, it doesn’t on PC. It simply makes things better, right? Sony doesn’t agree in terms of a fixed platform console.
“Moving to a different CPU – even if it’s possible to avoid impact to console cost and form factor – runs the very high risk of many existing titles not working properly,” Cerny explains. “The origin of these problems is that code running on the new CPU runs code at very different timing from the old one, and that can expose bugs in the game that were never encountered before.”
So there’s the confirmation that the Jaguar once again takes up the mantle as the PS4 Pro’s CPU; many tech analysts had entertained the possibility Sony would opt for a slightly beefier processor, possibly based on Jaguars ‘successor’ the Puma series of processor cores. While Puma is very similar in terms of architecture, there were a few subtle changes, and lower power consumption combined with higher potential clock speeds (such as the A6-6310 running at 2.4 Ghz). For those wondering, 2.1 Ghz isn’t quite ‘fully tapping out’ the Jaguar’s clock speed, even on 28 nm. The desktop based AMD Athlon 5370 runs at 2.2 Ghz (a mere 100 Mhz more than the Pro’s configuration), but a desktop has the benefit of less stringent power consumption and heat output confines compared to a console.
So, what did Sony do to alleviate the memory issue? Well, they simply added DDR3 memory to shunt off applications which aren’t in focus, while increasing the reserve of the faster GDDR5 memory for games. “We felt games needed a little more memory, about ten percent more. So we added about a gigabyte of slow, conventional DRAM to the console,” said Cerny.
“High-resolution graphics do need more memory. Estimates of what would be needed to double the display resolution of games were in the 300 to 400 megabyte range,” said Cerny, justfying the 1 GB addition. “But adding memory is a double-edged sword. With more memory it’s possible to have higher-resolution textures and more detailed models, but that requires developers to create those assets. If we go that route, rather than asking the developers for an increase of a fraction of a percent in their effort, we end up with them needing to spend [much more] on assets.”
“On the standard [PS4], if you’re swapping between an application like Netflix and a game, Netflix is still resident in system memory, even when you’re playing the game. We use that architecture because it allows for very quick swapping between applications. It’s all already in memory,” said Cerny.
“On PS4 Pro, we do things a bit differently. When you stop using Netflix, we move it to the gigabyte of slow, conventional DRAM. Using that sort of strategy frees up almost a gigabyte of our 8 GB of GDDR5. We use 512 megabytes [of that] for games, which is to say that the games can use 5.5 GB rather than 5 GB. And we use most of the rest to make the PS4 Pro interface 4K, rather than the 1080p it’s been to date. So when you hit the PS4 button, that’s a 4K interface.”
This information does confirm leaks from developers that the PS4 Pro does indeed offer an additional 512 MB of RAM for games. Whether 512 MB is sufficient in the grand scheme of things (especially in light of the Scorpio, which is rumored to feature up to 12 GB) is down to your imagination for now. There have been a few reports banded about on forums and websites which hint developers had asked Sony for more memory, but how much faith you want to put in these comments, or whether developer fears turn out to be unfounded remains a mystery.
Supposedly (according to Cerny) their ‘addition’ of the 512MB of RAM available to developers inside the PlayStation 4 Pro is for whatever they would want to use it for; but generally developers will use the memory for render targets (in simple terms, a render target is the 3d scene ‘rendered’ into a section of the systems memory so that pixel shaders can jump in and finalize the image. One can presume this will be pretty key for tiled rendering techniques, but other post-processing can be applied too, such as various edge detection and blurring).
For those who’re used to high end PC’s you might be wondering how super high resolution texture packs will fit into the mix, particularly with larger game worlds… the answer is they probably won’t. Judging from Mark Cerny’s comments such high resolution texture packs can take a lot of additional money for studios to produce and therefore it doesn’t quite sit into their philosophy of making 4K support easier for the studio. With this said, how good this argument is remains to be seen. With the PC becoming (well, already is) a heavily supported gaming platform and with Scorpio released only a year later (with about 4 GB of extra RAM judging from images from Microsoft) studios might well be much more willing to push the boat out for 4K textures and visuals, will this leave Sony’s machine behind?
Sony’s strategy to use the DDR3 memory as a sort of swap pool is a pretty smart one. As we’ll discuss a little more later in this article, the RAM does sport a higher clock speed (pushing the bandwidth from 176 GB/s to 218 GB/s). Sony’s solution appears to be focused on elegance here, and continues to highlight the company weren’t wishing to ‘reinvent the wheel’ with the PS4 Pro.
Mark Cerny explains that the GPU is doubled in raw number of Compute Units over the original PS4. “We doubled the GPU size by essentially placing it next to a mirrored image of itself, rather like the wings of a butterfly,” he said. “That gives us an extremely clean way to support the 700 existing titles, because we can turn off half the GPU and just run something that’s very close to the original GPU.”
“We were also able to take advantage of silicon process improvements (highlighted above, where Cerny confirms it’s a 16 nm FinFET process) and boost the frequency by 14 percent, to 911 MHz, which is what gets us from 2x the power to 2.28x the power,” said Cerny. “Additionally, we’ve added in a number of AMD roadmap features and a few custom features [we’ll get to this further in the analysis – Paul]. Some of these give us better efficiency when rendering for high-resolution displays. We also have support for more efficient rendering for PlayStation VR.”
Mark Cerny then points out how the clock speed from the original PS4 (which ran at just 800Mhz) is boosted to 911 Mhz, and the original number of Compute Units of the GPU was 18. Each CU contains 64 ‘processors’ (more on this later), so simply put, Sony doubled this number to 36 Compute Units across these this GPU ‘mirror’, meaning the total rose from 1,152 in the ‘vanilla’ PS4, to 2,304 stream processors in the PlayStation 4 Pro’s GPU. In essence, when the PS4 Pro is running a game which as NOT been patched, the system will simply ‘disable’ these additional resources, and lower the clock speed back to the original 800Mhz, essentially emulating functionality and performance found in the GPU of the original console.
For those wondering about a ‘boost’ in performance for games which don’t support ‘pro mode’ the answer is – no. As you may know, Microsoft’s Xbox One S does feature a slightly faster GPU (running at 914 Mhz compared to 853 Mhz of the launch machine). Some games such as Gears of War 4 do benefit from this additional clock speed for slightly better frame rates, but Sony have opted not take this route. Mark Cerny was quoted in saying that they needed everything to “work flawlessly” and they didn’t want people to be “conscious of any issues that may arise” should a customer make the jump from the PS4 to the Pro.
https://www.youtube.com/watch?v=zNSq42s9Lps&t=1s
“The old GPU next to a mirror version of itself. We just turn off the second GPU,” Said Cerny in an interview with The Verge. “Developers can patch these titles to boost graphics and performance in very subtle ways. But unless you have a 4K television, the difference will not be substantial.
During the PlayStation 4’s announcement, Mark Cerny went on record and confirmed that the system was indeed using AMD’s ‘Polaris’ (using AMD’s GCN 4.0), but Cerny also hinted Sony took advantage of technology ‘several generations’ ahead of Polaris too. Let’s not jump too far ahead of ourselves, and go through a list of what Polaris offers first.
The PlayStation 4 Pro’s memory layout was a bit of a head scratcher for some, not just the amount of RAM (we’ll get to that), but also the memory bandwidth increase over the original unit seemed a little… well, stingy. The launch console offered 176 GB/s of bandwidth, thanks to a 256-bit memory bus and GDDR5 RAM clocked at 5,500 Mhz. 5500 * 256 / 8 = 176GB/s. It was easy to expect Sony to either raise the bus width, or more realistically opt to use considerably faster GDDR5 chips, but instead we’re left with a 24 percent increase in bandwidth, 218 GB/s compared to the previous 176 GB/s (for those wondering, the GDDR5 RAM chips will be running at around 6,800 Mhz – 218000 * 8 / 256 = answer).
The answer to why the Pro required only 24 percent additional bandwidth was primarily the choice of Polaris architecture. AMD themselves have often spouted on the benefits of Polaris, and even the RX 480 (which also sports 36 Compute Units, but running at almost 1300 Mhz) only offers 256 GB/s memory bandwidth. Efficiency is the name of the game – and several technologies come into play.
The original PS4 gave up about 20 GB/s of bandwidth for the Jaguar processors, so even if we assume this number increases by the same 30 percent that the raw CPU clocks do, we’re only looking at the PS4 Pro’s CPU eating up about 26 GB/s, or to be on the safe side, let’s assume 30 GB/s of bandwidth reserved for the CPU, leaving the about 190 GB/s for the GPU and whatever else. Given the PS4 Pro’s GPU is considerably more efficient with bandwidth than the older PS4 models, we can safely assume Cerny and his team left ample bandwidth to push the frames on the PS4 Pro, but we’ll naturally need to wait and see what developers say on this issue over the coming months.
Delta Color Compression for example is explained by Cerny as “DCC allows for inflight compression of the data heading towards frame buffers and render targets, which results in the reduction of the bandwidth used to access them.” This ‘compression’ is loss-less, and because the data is made smaller, it requires less bandwidth to send over compared to older GCN versions.
Digging through AMD’s own Polaris technical slides hints at just how much additional bandwidth may be saved over the older GCN architectures, and the answer does depend heavily upon the scene. With ‘compressible data’ the savings were up to 35 percent, thanks to a combination of an improved Cache and DCC. For example, once again quoting AMD’s own technical slides, geometry meshes can be stored in the geometry engine rather than needing to constantly farm out to the L2 cache for the data. This is really quite a bonus for heavily replicating a mesh (such as organic shapes like trees, grass and hair). Furthermore, the peak compression ratio if 8:1 – which can be achieved on objects which are comprised of the same colors in ‘blocks’. So for example, a set of clothing or a car, which will have large patches of the same color can make heavy use of DCC. How much (if any) changes AMD have made for these technologies for the PS4 Pro isn’t clear, but assuming even a Polaris level, there’s major improvements over the base PS4.
Speaking of cache, the amount of cache is currently a bit of a mystery on the PS4 Pro (unless you’re reading this in the future, in which case you might well know)! For the record, the original PS4 featured 512 KB of Level 2 (L2) cache. Higher end AMD desktop parts prior to Polaris featured up to 1 MB of L2 cache, but Polaris features up to 2 MB. So how much cache the PS4 Pro features currently isn’t known. It’s possible that we could see the same amount of cache as the original unit (unlikely) or either double the amount (thus, in keeping with Sony’s philosophy of doubling the PS4’s GPU for the Pro) or finally the ‘full’ 2 MB cache found in the 36 CU RX 480 (remember, there’s 36 CU’s in the PS4 Pro too).
AMD also added a Primitive Discard Accelerator to Polaris, and Cerny explains its job is to remove triangles from the scene which are too small to affect the look of the rendered scene. In a nutshell, this means that the GPU will intelligently ‘nuke’ rendering triangles (the building blocks of geometry used to ‘draw’ the in game worlds) if it feels they’re unimportant to the look of the scene. Really, PDA will probably benefit games running on the PS4 Pro which have a lot of polygons, and can certainly be used for cloth and meshes. For example, cloth, water and other complex geometry.
How much of a difference does Primative Discard Accelerator make? Once again, it will depend upon the scene and objects, but using the same technical slides linked above, AMD hints Anti-Aliasing scenes with heavy amounts of Tessellation can be up to 3.5X faster on Polaris compared to their older architectures (we’ll get much more into scene reconstruction, Anti-Aliasing and so on in the next part of this series). To quote AMD “Polaris geometry engines can detect triangles that have no area, and discard them during the input assembly stage. As vertex indices are read from the inputbuffer, the Polaris geometry engine will check if two or more vertices have the same coordinates (i.e., degenerate triangles). The degenerate triangles are culled before they are passed to the vertex shaders, which increases throughput by reducing the amount of work done and reducing the energy consumed. By eliminating the vertex fetches for degenerate triangles.”
“… It is common that small or very thin triangles do not intersect any pixels on the screen and therefore cannot influence the rendered scene. The new geometry engines will detect such triangles and automatically discard them prior to rasterization, which saves energy by reducing wasted work and freeing up the geometry engines to rasterize triangles which will impact the scene. The new filtering algorithm can improve performance by up to 3.5X.”
“The work distributor in PS4 Pro is very advanced. Not only does it have the fairly dramatic tessellation improvements from Polaris, it also has some post-Polaris functionality that accelerates rendering in scenes with many small objects… So the improvement is that a single patch is intelligently distributed between a number of compute units, and that’s trickier than it sounds because the process of sub-dividing and rendering a patch is quite complex.”
Like all modern GPU’s, the PS4 Pro’s GPU is comprised of multiple shaders (ALUs) which number in their hundreds. Nvidia calls each of these shaders a CUDA core, where as AMD (the manufacturer responsibile for outfitting both Microsoft and Sony’s console) refer to these as Stream Processors. In the original PS4, each compute unit (CU) contained 64K of local (on-chip) memory, 16K L1 cache, general purpose vector and scalar registers (VGPR/SGPR). Furthermore, each CU housed 64 Stream Processors which were broken down into a set of 4 “SIMD” (Single Instruction Multi Data) units, each housing 16 Stream Processors (16 x 4 = 64).
As you can see from the above slides, Polaris does indeed enjoy some significant tweaks to the shader efficiency of the GPU, and AMD boasts that on average Polaris should deliver about 15% improved performance per CU over the older GCN architecture – which naturally translates rather well to the PlayStation 4 Pro’s own performance.
The important thing about this is ensuring that each CU is doing its own share of work. Mark Cerny explains “Once a GPU gets to a certain size, it’s important for the GPU to have a centralised brain that intelligently distributes and load-balances the geometry rendered. So it’s something that’s very focused on, say, geometry shading and tessellation, though there is some basic vertex work as well that it will distribute.”
Cerny indicates that the PS4 Pro’s GPU is considerably more advanced than the original PS4 GPU, and indeed even more advanced than the current Polaris architecture from AMD. While it’s outside the scope of this article to go in-depth into a GPU rendering pipeline, in essence Vertex work (Vertex is the ‘corner’ of a triangle, so each triangle has 3 Vertex. Each of these Vertex are ‘held’ in 3d space using coordinates). Geometry Shaders are a little (lot more complex) to understand, and in essence can dramatically ‘alter’ the primitives by adding points to them, removing or transforming them into dramatically more complex objects.
Meanwhile tessellation is the stage in the rendering pipeline where patches of vertex data are subdivided into smaller Primitives. One of the primary reasons Tessellation was introduced to graphics API’s as a way to counter the static nature of 3d models and levels of detail (LOD). For example, character models (such as faces) require higher levels of detail when closer to the camera (as you can see more detail such as subtle ripples in muscles and joints, additional hair details and so on), but at further distances, fewer triangles are needed to accomplish a similar aesthetic, also less power is required to process those lower detailed models. Tessellation can sub divide each triangle, add in additional details based on the distance of the camera and thus reduce artist workloads.
Another mystery is what Sony have done in regards to geometry engines (geometry processors). Both the Xbox One (and S) and the original PlayStation 4 touted two geometry engines each (there was a common misconception early on that the Xbox One featured just a single Geometry Engine). Polaris RX 480 features up to 4, and given the doubling of the CU’s, it’d make a decent amount of sense that the PlayStation 4 Pro received a similar treatment too.
From Cerny’s comments it appears the PS4 Pro’s GPU is better at prioritising workloads and optimizing their scheduling across the various Compute Units, this in theory acts to minimize stalls in the graphics pipeline and also ensuring workloads are better distributed across multiple shaders.
Another small tweak to the PS4 Pro will enjoy (technically, it isn’t new to Polaris, it was introduced to the third generation GCN) is improvements to how Hardware Scheduling is handled. the hardware was designed to handle up 8 compute queues per ACE (Asynchronous Compute Engine), where as Fiji and now Polaris make it possible to virtualize these queues. Essentially, the tasks will ‘wait’ until there’s an ACE to take up the queue, and this should (in theory) reduce pipeline stalls. This allows the hardware to dispatch a larger number of wavefronts simultaneously and therefore a higher level of Streaming Processor occupancy.
The original PS4 GPU uses AMD’s a customized version of AMD’s GCN 1.1 (Graphic Core Next) architecture; these custom ‘tweaks’ were specifically designed to push the efficiency of the GPU for compute (such as the bumped up number of Asynchronous Compute Engines which helps organize and prioritize graphics and compute tasks, and also the introduction of ‘Volatile Bit’ whose job was to reduce the overhead associated with running compute tasks (we first learned about this in an interview with Gamasutra). Volatile Bit accomplishes this by ‘tagging’ a piece of compute data inside the Level 2 cache of the GPU with a special tag, so when it’s read it can simply invalidate that specific line from memory rather than needing to ‘flush’ the entire L2 cache of the GPU. Some of these ‘custom’ tweaks later appeared into AMD’s own discrete desktop roadmaps. Well, Sony have opted to venture into AMD’s upcoming roadmap and take advantage of half-precision floats.
I’ll let Mark Cerny have the first word: “To date, with the AMD architectures, a half-float would take the same internal space as a full 32-bit float. There hasn’t been much advantage to using them. With Polaris though, it’s possible to place two half-floats side by side in a register, which means if you’re willing to mark which variables in a shader program are fine with 16-bits of storage, you can use twice as many. Annotate your shader program, say which variables are 16-bit, then you’ll use fewer vector registers.”
“Multiple wavefronts running on a CU are a great thing because as one wavefront is going out to load texture or other memory, the other wavefronts can happily do computation. It means your utilisation of vector ALU goes up.”
“Anything you can do to put more wavefronts on a CU is good, to get more running on a CU. There are a limited number of vector registers so if you use fewer vector registers, you can have more wavefronts and then your performance increases, so that’s what native 16-bit support targets. It allows more wavefronts to run at the same time.”
“One of the features appearing for the first time is the handling of 16-bit variables – it’s possible to perform two 16-bit operations at a time instead of one 32-bit operation. In other words, at full floats, we have 4.2 teraflops. With half-floats, it’s now double that, which is to say, 8.4 teraflops in 16-bit computation. This has the potential to radically increase performance.”
As one can imagine, the idea that we’re seeing 8.4 TFLOPS in a console has boggled a lot of people’s minds, and the quote has been banded around the internet over the past few day. If you were to take this at face value, it means that the PS4 Pro is a massive upgrade from the original machine, right? Well, firstly let’s examine what those statements and terms mean before we attempt to figure out what real-world differences they make to the frames and pixels on screen!
If you’ve been following RedGamingTech for any length of time, you’ve probably heard of the successor to AMD’s Polaris architecture, known as Vega. We don’t know all of the technical specs of Vega by any means, but it’s probably a fair assumption that the additional half-precision support comes from Vega, and this half-precision support itself closely mirrors what Nvidia have been pushing with their Pascal line of graphics cards (particularly the GP100 based cards which has uses in HPC).
While there’s multiple types of value types, for the purposes of this article we’ll stick to the very basics. Integers, Floating Point, Double Precision and finally half-precision. Integers are whole numbers (such as ‘1’), while FP (Floating Point), Double Precision and Half-Precision can represent more accurate values, because they’re able to offer numbers and a decimal place. Unfortunately, explaining floating point numbers is a little outside the scope of this article, so we’ll instead give enough for you to understand the concept if you’re unfamiliar.
GPU performance is typically calculated in Floating Point performance, which is SINGLE precision. So in the case of the ‘vanilla’ PS4, there’s 1.84 TFLOPS of single precision performance. While the ‘accuracy’ of a standard float (which is 32 bit) is a bit up in the air and varies on specific circumstances, so for the purposes of this article, we’ll refer to it as 7 significant decimal digits. The term ‘significant’ means that there are certain ‘rules’ which must be followed, such as Leading zeros are never significant.
Anyway, half-precision (which is a 16-bit float) does away with some of the accuracy of a 32-bit float, but in theory offers increased execution speed and fewer resources (numbers and data can be stored in memory at a smaller cost). So, Mr. Cerny in his first part of the comment is simply saying that if a task (which is run on the GPU) doesn’t need the additional precision of a 32-bit ‘regular’ float, and will operate just fine on a 16-bit half precision float, developers (programmers) can earmark a piece of code which tells the PS4 Pro’s GPU “hey, you’re free to execute this as a half precision calculation” and essentially save resources.
Now, unfortunately we don’t exactly know how AMD are accomplishing this, because Vega (if this technology does stem from Vega) hasn’t been released yet; but we can assume it would be achieved in a similar manner to how Nvidia accomplished this with their Pascal range of GPUs. “GPUs have used half-precision for at least a dozen years as a storage mechanism to save space — for textures — but we’ve never built an arithmetic pipeline to implement the 16-bit floating point directly, we’ve always converted it,” Lars Nyland, Nvidia’s chief architect explained during a presentation. “What we’ve done is left it in its native size and then pair it together and execute an instruction on a pair of values every clock – this is compared to the single-precision where we execute one instruction every clock and the double-precision runs at one every two clocks.”
So, we can presume if AMD have taken a similar path with Vega as Nvidia have with their Pascal architecture, two half-precision instructions on the PS4 Pro will simply be ‘bundled’ together and executed in tandem in the same space of time.
One ‘issue’ with GPU’s is latency, which is part of their highly parallel nature. Both Nvidia and AMD have crafted their own solutions for this. For the purposes of this article I’ll focus on AMD which uses WaveFronts. At it’s most basic, a WaveFront is a group of ‘threads’ which can be executed on a GPU. Modern GPU’s can have thousands (tens of thousands) of WaveFronts occurring in very short spaces of time. For example, using the AMD Radeon 7970 GPU (which features 32 Compute Units), AMD points out that the GPU can have up to 81,920 work items simultaneously. In this design, the whole CU can have ’40 WaveFronts in flight, each potentially from different work-groups or kernels’. Each ‘WaveFront’ is comprised of up to 64 threads at once
In a nutshell, this means that the GPU doesn’t necessarily need to have the ‘threads’ the same across all of the WaveFronts (a very simple real world example would be a single WaveFront (because it’s a GROUP of instructions) could contain your shopping list and your diary for tomorrow if ‘there’s space’ or, just your shopping list (there’s latency, number of threads and scheduling to take into account, but that’s the simple way to think of it).
Okay – so now we understand what half-precision floats are, and how Cerny explains they ‘run’ on on the PS4 Pro’s GPU – but does it really make any difference? Well, as much of a cop-out answer as it is – it depends. Clearly for Mantis Burn Racing, there has been quite the impact, and it allows the developers to natively push the game to 4K running on the PS4 Pro’s system. But there’s a few obvious questions which arise; not least of which what ‘percentage’ of workload on an average frame can be pushed into a 16-bit half precision float, and what instances isn’t that good enough.
In massive and open worlds worlds, where large distances are common and projectiles can travel 100’s (or thousands) of in game units, lower precision can be a problem (though there are methods to get around this, but it does very much depend on the game). Certain types of AI (such as Deep Learning) can do just fine on FP16, along with certain types of image processing and data dealing with High Dynamic Range, color correction and other such tasks. So this (at least in theory) could mean that certain physics, collision detection or image processing can benefit – but ultimately it depends on the game and a case by case basis.
In fact, there’s quite a strong argument for 16-bit floats to be pushed into HDR rendering (which Cerny and Sony have been pushing for the PS4 Pro). In a nutshell, High Dynamic Range is the ability for a screen to more accurately represent a scene by providing a wider gamut of colors and levels of brightness and darkness. If you need a simple visual example, think of when you’ve tried to take a photo of say the moon and stars at night on a regular phone camera, and found the results a little… disappointing. Typically, in non HDR screens, there’s a process called “tone mapping” which takes the values from the system and maps them to what your display is capable of. This tone mapping is clamped between 0.0 and 1.0, so in theory a 16-bit RGB float is sufficient to represent the ranges within HDR outputs.
A half precision float has the following available to it: A sign Bit (is it a positive or a negative value), the exponent (5 bit) and finally significand precision (11 bits, 10 explicitly stored). The Exponent can be thought of as the range of number, while the significand is the ‘accuracy’ of the number. What this means is that in theory, Bloom, Blurs, HDR, Depth of Field and so on for graphics, and also as mentioned, AI can possibly be good for this too.
Cerny’s claim is simple – where half-precision is ‘enough’ two operations can be pushed simultaneously through the register and thus you can get information through the GPU much faster.
While the figure of 8.4 TFLOPS sounds impressive on paper (or on a computer screen), the real world performance and benefits will really be down to the games which developers are crafting, if they’re willing to put in the work (in other words, flagging certain operations for half precision) and of course, if half precision is ‘good enough’ for that particular task. Essentially, lack of ‘precision’ can lead to rounding up or down a number, which can lead to accuracy issues.
In the next part of this series we’ll delve into the PS4 Pro’s checkboard rendering, 4K upscaling, improvements in anti-aliasing and finishing off our initial thoughts and analysis.
Pre-order the PS4 Pro Amazon.com | Pre Order PS4 Pro Amazon UK – affiliate links!