For quite some time, GPU’s have done much more than simply push pixels around the screen. The fixed function pipelines of the late 90’s are long behind us, and GPU’s are now being used in a variety of non-graphics tasks, such as physics, collision detection or even AI. But there is still a long way to go, with the older API’s hampering just what GPU’s can do, according to AMD.
More console like, low level API’s aren’t too far away though. Gamer’s should be getting their hands on the first wave of DirectX 12 games by the end of the year and who can forget Vulkan? Vulkan is being authored by the Khronos group (who’re responsible for various iterations of OpenGL) and is based on AMD’s Mantle API.
The problem with DirectX 11 (as you probably know) is that it’s an heavily abstracted API and multi-thread support of DX11 is pretty awful (both from a GPU and CPU point of view). Often, multi-core CPU performance of DX11 (for example for draw calls) has very little advantage over using a single core. When we tested DX11 vs DX12 with the FutureMark API Overhead test, D3d12 shown itself to be about 17 times faster in issuing draw calls. DX11 is a serial API, it handles one thing at a time, in a pre-determined order.
Because compute in gaming is becoming increasingly important, AMD believes that the next generation API’s will have a drastic impact on how the GPU’s will be used, they’ve been built to allow developers to more easily schedule compute work. AMD believes this will allow developers to take advantage of the Asynchronous Compute Engine’s (ACE) which are part of the GCN architecture.
For a bit of history, the original HD 7900 series debuted back in 2012 and featured 2 ACE’s per GPU (for example, the Radeon HD 7970), while the more modern R9 290X (review and architecture analysis here) sports 8 ACE’s, and each manages up to 8 queues. For point of reference, the Xbox One has 2 ACE’s (8 queues each) and the PS4 has 8 ACE’s (once again, 8 queues each – we’ll talk about why that’s important in a minute).
The ACE system was designed to run tasks in parallel and had a limited ability to run tasks out-of-order (similar to how a CPU operates) – at least in hardware. Unfortunately the ACE’s weren’t very useful in the DX11 API (because the API were incapable of running Compute and Graphics simultaneously), but despite this AMD decided to still add in the additional ACEs, believing they’d be useful in the future.
Typically, the GCP (Graphic Command Processor) took on the responsibility of handling graphics tasks, and has full reign of the GPU. It accesses the shaders (the 100’s of little processors on the GPU, for the GCN architecture, each Compute Unit houses 64, and you might find say 32 CU’s in a GPU), ROPS and geometry. It’s responsible for taking the instructions from the CPU and assigning the tasks to the GPU.
The ACEs on the other hand are different, they only get access to the Shaders. Things are changing now, and can be now be used to execute compute shaders too. If you’re unfamiliar with the term Compute Shader, they can be used for a large variety of tasks – for example, calculation of lighting, particles systems and more. It also allows rendered materials to be deformed or altered – for example, clothing or even realistic water surfaces which will move in a realistic way if you say shot an arrow into a pond, or stepped into a puddle. Ubisoft did a presentation of Compute Shaders for both PS4 and X1, which we did an analysis on here if you want more info. Furthermore, not only can the ACEs process multiple command streams in parallel, but it also doesn’t have to wait for a task to complete before a new command is issued.
As a slight side note, you might remember from Microsoft’s own DirectX 12 presentation that the GCP can now be overwhelmed by the CPU’s command instructions too. In other words, with DX12 you can run into one of two scenarios – CPU bound (the CPU simply can’t issue anymore draw calls) or GCP bound (the GCP is working flatout and cannot accept anymore instructions).
As many know, Microsoft and Sony both opted to use AMD’s architecture for the basis of their consoles (with the APU inside both machines running the same Jaguar CPU and very similar GCN architecture, with a few differences).
These functions are already being more widely used on the consoles – AMD state the Playstation 4 is running asynchronous shaders in a variety of its games – including the rather impressive looking Infamous: Second Son. When we conducted our analysis of the Second Son Engine (Part 1 – CPU & RAM | Part 2 – Graphics & Compute Scheduling) it was clear that Sucker Punch were indeed inter-splicing compute tasks with pixel tasks.
This allowed Sucker Punch to push a staggering number of particles – 30K particles per frame of animation on average, though their engine can handle up to 120K. Considering the main character (Delsin) uses smoke and various neon powers throughout the game, this is vital.
Thief, when using the Mantle API is currently the only PC title to use Asynchronous shaders according to AMD (but to what degree is unknown). The Xbox One wasn’t mentioned in the slides – so it’s possible that the DX11 API inside the system doesn’t allow it, but things could certainly change when it is updated to DX12.
Now the GPU will be able to run multiple tasks simultaneously, so while the GPU is drawing textures (graphics work) it might also be calculating lighting or applying some post process effect too.
Tasks are sent in a stream of commands (the command stream is created by merging a set of individual commands, which contain their own individual queues) which the shaders then need to executed. With DirectX 12, things are a bit different – they can now use a new ‘merging’ methods (that’s Asynchronous Shaders) which is asynchronous, meaning that it supports multiple graphics threads simultaneously, and and supports both pre-emption and prioritization. So, because the tasks are better arranged you’ll have less ‘gaps’ in the queue, less time where the GPU isn’t being fully utilized (so, in theory you will have fewer instances where the GPU’s shaders are only being say 80 percent used). There is also a priority system in place – so if a task pops up (that wasn’t scheduled) but is super important and will only take a moment to execute it’ll jump ahead of a longer to complete and less important task.
In theory, this will mean that asynchronous shaders can be used with minimal impact on the performance compared to traditional methods (DX11). For evidence, AMD would likely point you to its LiquidVR SDK demo, where it says Asynchronous shaders improved the performance by 46%.
The demo scene hit 245 frames per second with both Asynchronous shaders and post processing effects disabled. The frame rate plummeted to only 158 frames per second when Post-Process was enabled, and if Post Process and Asyncronous was used, the frame rates went up to 230 FPS… just a 6% loss in frame rate. For reference, post-processing is a pretty blanket catch-all term which can mean anything from various forms of Anti-Aliasing to motion blur and also various lighting and other shading effects.
The longer a task takes to complete (or render) the fewer of those scenes (frames) can be shown in any one second. For example, 60 FPS (the average number most gamers feel is acceptable) leaves the system just 16.67 miliseconds to display each and every single frame of animation. Obviously if a gamer wants to target say 120 FPS (to better take advantage of newer display technologies) then the system has just half the time, about 8ms.
Technically – GPUs do already work on multiple threads at once – it’s the key to their massive parallel throughput. But, if you read any GPU computing whitepaper you’ll spot one thing – a lot of references to latency and masking latency. So yes, while a lot of ‘threads’ are able to be run at once, they take quite a bit of time to actually complete – they must be setup and carefully managed.
This is the beauty of having multiple work queues – the GPU has additional pools of threads it can pick and choose from. It’s still not perfect – but it’s a heck of a lot better than we currently have. This will help mask and hide the overhead that comes from Context Switching.
GCN supports 1 parallel graphics context and multiple compute contexts which are determined by the number of ACEs. It’s worth noting that the Xbox One for instance can handle up to 48 contexts (we know this from the SDK leak – check it out if you want a deeper understanding of what a Context is and how it works for consoles). A graphics or compute context is the basic drawing (or compute job) attributes – such as how big something else, colors to use, or whatever else.
Multiple popular game engines already support the Asynchronous Shaders, including Unreal Engine and CryEngine; a very good thing when you consider the number of developers who license these engines instead of making their own. It should mean a lot less work for those developers, and (in theory) a greater adoption of AS, which would obviously make AMD rather happy.
Keeping things PC focused for a moment – it’s a tricky thing to predict the future. and honestly both DX12 and Vulkan are going to have a massive impact on the PC gaming scene. But it isn’t just the new API’s we’ve got to think of – but a whole lot of new hardware too. 4K monitors are getting cheaper, and 1440P screens with high refresh rates are becoming increasingly common – oh, and one must also consider FreeSync and G-Sync too (and the higher refresh rate screen technology…).
Furthermore, there’s VR – and AMD, Valve and Oculus are just a few companies pushing the technology. The LiquidVR SDK is supposedly designed to reduce latency for VR – as latency is VR’s mortal enemy (as anyone who is fortunately enough to have tried VR will tell you).
In terms of graphics hardware, DX12 is also rumored to allow you to better mix and match in GPU configurations (including possibly even having a Geforce and a Radeon GPU running together). That’s not including technology such as High Bandwidth Memory (HBM) that’s going to be (supposedly) introduced with AMD’s R9 390X cards.
Switching to the console side of things – the PS4 is already taking advantage of Asynchronous Shaders (well, a handful of games are at least), but the Xbox One seemingly isn’t. It’s possible that the frame rate for games which use a lot of Post-Process effects could therefore rise on consoles in the future. It’ll also likely benefit the PC too – since when DX12 Asynchronous shaders work on the X1, porting them to the PC won’t be too much of a chore.
It’ll be an interesting 12 months in the gaming industry – and it’ll be extremely curious how developers choose to leverage these options on both the PC, Xbox One and the PS4.