The Playstation 4 and the Radeon R9 280X(known as AMD Volcanic Islands) may share quite a few similarities with each other, in particular in the way their Asynchronous Compute Engines (ACE) function. Both the Playstation 4 and AMD’s soon to be released Volcanic Island GPU feature 8 ACE, rather than the 2 featured in the vanilla Sea Island GPU (which is effectively what the PS4 and Xbox One GPU’s are built around).
Both GPU’s ACE units can 8 compute queues, and so total that gives 64 compute commands (using the math of 8 compute queues * 8 ACE = 64 compute commands). We know that Sony believes that compute is going to be incredibly important in the next generation of graphics, with the basic premise being that CPU’s aren’t well suited to processing certain tasks. The reason behind this is that a CPU core can only process one command at a time. Modern day GPU’s on the other hand work in a SIMD (Single Instruction Multi Data) fashion. Hundreds (or in some cases, thousands) of processors (known as SP’s – Stream Processors for AMD’s GPU, or CUDA cores for Nvidia’s GPU’s) all work together to process a piece of data.
What are the PS4’s / AMD Asynchronous Compute Engine’s and how do they work?
The GCN architecture (Graphic Core Next) uses 2 ACE units, which could handle 2 compute queues (giving a total of 4. 2 Asynchronous Compute Engines * 2 compute queues = 4). The GCN architecture was used for AMD’s previous generation cards, including the AMD Radeon HD 7970 and the later Radeon R9 280X.- Sony had worked very closely with AMD with the desire to beef this up for the Playstation 4, with the intention of offloading much of the work from the CPU to the GPU. Let’s start by explaining how all of this works:
The Asynchronous Compute Engine basically tells data were to go, queuing up the commands and then processing them when the GPU has the spare “Cycles” to process it.
Imagine yourself driving on a highway, with hundreds of lines – but ahead of you is a toll booth. Consider that you can’t see which booths are free from your perspective (you’re stuck behind large trucks and so on after all). So you’re effectively only able to use signs tell you which lane to take. The more ACE’s there are (in this case, electronic signs) to tell you which lane to go to, the faster you’ll be sent to a lane that’s tollbooth is as empty (or at least as free) as possible.
So in other words, the ACE will accept work, and then dispatch it to a CU (compute unit) for processing when its resources are freed up. The task of the ACE is to figure out the priority of the task – in other words, to ensure that if it processes a bit of compute data, it won’t negatively affect the frame rate of the title.
Just to put this into some perspective, the PlayStation 4 has 1152 of these lanes (Streaming Processors), with 64 ‘signs’ telling the cars where to go. As a comparison, the Xbox One has 768 ‘lanes’ (Streaming Processors), with 16 ‘signs’ to tell the traffic which way to flow. For those with a love of math, that’s 18 lanes (Streaming processors) per compute queue for the PS4, vs 48 per compute queue of the Xbox One.
Mark Cerny – the lead architect behind the Playstation 4 has spoken heavily of some of the changes that he and his team had made concerning the PS4’s hardware. Below is a direct quote from Mark Cerny:
- “First, we added another bus to the GPU that allows it to read directly from system memory or write directly to system memory, bypassing its own L1 and L2 caches. As a result, if the data that’s being passed back and forth between CPU and GPU is small, you don’t have issues with synchronization between them anymore. And by small, I just mean small in next-gen terms. We can pass almost 20 gigabytes a second down that bus. That’s not very small in today’s terms — it’s larger than the PCIe on most PCs!
- “Next, to support the case where you want to use the GPU L2 cache simultaneously for both graphics processing and asynchronous compute, we have added a bit in the tags of the cache lines, we call it the ‘volatile’ bit. You can then selectively mark all accesses by compute as ‘volatile,’ and when it’s time for compute to read from system memory, it can invalidate, selectively, the lines it uses in the L2. When it comes time to write back the results, it can write back selectively the lines that it uses. This innovation allows compute to use the GPU L2 cache and perform the required operations without significantly impacting the graphics operations going on at the same time — in other words, it radically reduces the overhead of running compute and graphics together on the GPU.”
- Thirdly, said Cerny, “The original AMD GCN architecture allowed for one source of graphics commands, and two sources of compute commands. For PS4, we’ve worked with AMD to increase the limit to 64 sources of compute commands — the idea is if you have some asynchronous compute you want to perform, you put commands in one of these 64 queues, and then there are multiple levels of arbitration in the hardware to determine what runs, how it runs, and when it runs, alongside the graphics that’s in the system.”
Volcanic Island Compute Enhancements – Thanks to Sony?
Sony’s specific requirements for the modifications made on the Playstation 4 have driven major improvements to the discrete graphics card market too. One of the members (3dcgi) of the forum Beyond3d works for AMD, and has spoken candidly on this subject.
“I worked on modifications to the core graphics blocks. So there were features implemented because of console customers. Some will live on in PC products, some won’t. At least one change was very invasive, some moderately so. That’s all I can say.
Since Cerny mentioned it I’ll comment on the volatile flag. I didn’t know about it until he mentioned it, but looked it up and it really is new and driven by Sony. The extended ACE’s are one of the custom features I’m referring to. Also, just because something exists in the GCN doc that was public for a bit doesn’t mean it wasn’t influenced by the console customers. That’s one positive about working with consoles. Their engineers have specific requirements and really bang on the hardware and suggest new things that result in big or small improvements.”
So according to this member, it was Sony’s vision which has actually changed the development of GPU’s for PC as well. If this is true – then it’s certainly very… hell let’s use the word, cool. It also demonstrates that Volatile Bit was a purely Sony creation, although there were ways to purge the Level 2 Cache previously, it wasn’t as pinpoint as the method’s that Mark Cerny and his team had created.
“If you look at the portion of the GPU available to compute throughout the frame, it varies dramatically from instant to instant. For example, something like opaque shadow map rendering doesn’t even use a pixel shader, it’s entirely done by vertex shaders and the rasterization hardware — so graphics aren’t using most of the 1.8 teraflops of ALU available in the CUs. Times like that during the game frame are an opportunity to say, ‘Okay, all that compute you wanted to do, turn it up to 11 now.'” – Mark Cerny
For Sony – because they wanted to facilitate advanced physics, fluid dynamics (such as smoke, fog, cloth, water and the like), AI and more – they realized that pushing this off to the GPU made sense. The PS4’s GPU has a total of 1.84TFLOPS of computing power available. The PS4’s CPU has 102GFLOPS of performance – roughly 1/18th of the GPU. Furthermore, we have to remember that 2 of the PS4’s AMD Jaguar CPU’s 8 cores are allocated to the Operating System. We know this because of the post mortem of Killzone Shadowfall. This means that in reality, there’s only about 75 percent of the PS4’s AMD Jaguar CPU resources available to games titles. When we consider what the future holds – the future for many things will be pushing them over to compute. We’ll be seeing much of this with AMD’s TressFX and of course, the new AMD Mantle API.
AMD have a vision for the future – they call it fusion. The idea of a CPU and GPU integrated in ‘fusion’ (as AMD call it). To this end, Sony’s vision for the increase in compute commands of the Playstation 4’s GPU must have been just the ticket they needed. An improved compute functions for their next generation APU’s. Currently, AMD are planning that all future PC processors will eventually be APU’s. And no doubt, with a vastly beefed up Asynchronous Compute Engine to boot.