DX12 Asynchronous Shader & Command Buffer Analysis | AMD Exclusive Interview & Details

directx-12-asynch-shaders-command-buffer

With Windows 10’s release drawing ever closer, DirectX 12 news continues to pop up thick and fast. A few days ago, AMD revealed that the Asynchronous Shaders, (which is part of DirectX 12 specs) and the Graphics Core Next (GCN) Asynchronous  Compute Engine architecture powering Radeon graphics cards will drastically improve performance of DX12 games by 46%. Since then, team red have revealed further information on how this works, and also additional information on how Multi-Threading functions in D3D12.

Asynchronous Shaders Explained

Essentially, modern GPU’s are a collection of hundreds of processors which work together to perform a task. In the case of AMD’s GCN architecture, 64 shaders (along with cache, and other components) form a single Compute Unit, and many CU’s form the basis of a single GPU. The R9 290X, for example, contains 32 Compute Units, meaning a total of 2048 shaders are available on the GPU. These shaders must be fed data, and this is handled by either by the single Graphics Command Processor (GCP) or the Asynchronous Compute Engines.

There are some subtle variances between GPU families – for example, some of AMD’s GPU’s have a greater number of ACE’s, or in the case of the Xbox One, it is confirmed to contain two GCP’s (mostly likely the second is for OS usage, rather than for games).

The Graphics Command Processor handles work from the graphics queue – in other words, what the games want drawn on screen, and the ACE’s handle Compute Queues. A greater number of ACE’s allows the GPU to process a greater number of compute tasks simultaneously, and therefore increase the efficiency of the graphics card, therefore improving performance.

AMD-R9-290x-Hawaii-Block-Diagram

The biggest enemy of GPU’s is latency – in other words, having ‘gaps’ in the the shader pipeline where the GPU is not doing anything. Developers have struggled with GPU latency for some time now, but really the blame wasn’t on the GPU, but more on the API (in this case, DirectX 12) which wasn’t good at handling tasks in parallel (at the same time).

Just like how D3D11 wasn’t very good at multi-threading on CPU’s, it also wasn’t very good at thinking in parallel for GPU’s, which is pretty insane, considering parallelism is the GPU’s strongest point. Instead, commands would be processed in a serial fashion, one at a time. The ACE’s would attempt to schedule work, even while the GPU was still handling a rendering task, and eventually that work would be carried out.

Problems arise when you consider that certain tasks have a higher ‘priority’ than other tasks, and you weren’t able to interrupt work that was being completed. Developers therefore started to push towards PreEmption, which allows tasks with a higher priority (which can be set manually or automatically) to go and be processed ‘first’, and less time sensitive tasks will be forced to wait until that work is completed. GPU’s handle this by use of Context Switching, PreEmption is often better than a ‘every task for itself’ approach, but that’s not to say it doesn’t have inherent problems.

Because DirectX 11 is essentially serial in its thinking, PreEmption can cause a lot of idle time as Context Switching (a Context Switch, at its most basic, is the processor saving results of one task, and switching to another to begin processing that task) occurs, and naturally this time where the GPU isn’t processing data is essentially wasted performance and can also create stutter in the frame rate (frame rate or frame time variance).

Async Scheduling on AMD hardware using DirectX 12. Little to no latency.
Async Scheduling on AMD hardware using DirectX 12. Little to no latency.

AMD believes therefore that DirectX 12 and Async Shaders (which are part of DX12’s spec AMD’s ACE are excellent at being able to leverage the advantages of DirectX 12’s Async Shaders and easily able to segment the workload efficiently.) counter this “by interleaving these tasks across multiple threads to shorten overall render time.”

According to AMD’s Robert Hallock “A developer doesn’t need to change the way they write their shaders to use AS [Asynchronous Shaders], so it’s relatively easy to extract gains on AMD hardware. It’s part of the core DX12 spec, so it’s not even something that needs to be specifically added to an engine. You support DX12, you have it.”

If you remember the first part of our analysis, we discussed how console games such as Infamous Second Son were already taking advantage of it, and PC titles were slowly too (Thief). Robert points out: “This is one of many cases where the consoles are improving the performance and flexibility of the PC.”

Robert continues his explanation: “Graphics rendering tasks naturally have gaps or “bubbles” in the pipeline. AS fills those bubbles with compute. Instead of having to wait for a graphics task to end to do the compute work, or having to pause graphics to work on compute. This reduces scene rendering time, and increases GPU utilization by activating resources that would have been dormant in DX11″.



DX12 Multi-Threaded Command Buffer Recording Explained

Before we get into the benefits of DirectX 12’s Command Buffer Recording, let’s first establish what a Command Buffer is. In a nutshell, it’s commands that the game asks for the CPU to process, before sending it off to the GPU in in a series of instructions and draw calls. These commands must be executed in a specific order and must be synchronized so that the CPU and GPU are seeing the correct data.

This is especially true when the GPU is being asked to process a lot of different processes or applications, when each application will have its own private memory (and with that, it’s own data) that’s ringed off from the other applications.  Commands can be a lot of different things, everything from memory functions (such as copying data) to something like “draw a box, now make it yellow”.

If you’ve been following along with D3D12’s development for some time, you’ll know the biggest improvements the API will bring is improved Multi-Threading. Looking back to the late 90’s or early 2000’s when processor improvements were largely a result of IPC (Instructions Per Clock, or how many things a CPU could do per mhz), increased clock speed (how fast the CPU ran) and other architecture improvements such as more cache. This is no longer the case, and for a multitude of different reasons (such as power and heat) it’s no longer efficient to crank out CPU’s with ever higher clocks (Intel tried and failed miserably with the Pentium 4, which had a long pipeline to accommodate the higher clocks; Intel eventually made the best of a bad situation by leveraging the long execution pipeline and made Hyper Threading a ‘thing’).

DX11 Command Buffer, notice high usage of only a single core
DX11 Command Buffer, notice high usage of only a single core

So, modern CPU’s come with an increased number of cores; how many cores depends on the manufacturer and SKU. For instance, Intel’s Skylake processors is all but confirmed to retain the same Four CPU cores (with Hyper Thread) for the higher end parts such as the 6700K, that were present in the Nehalem architecture (the I7 920, for example). There are CPU’s which break this mold, such as the 5820K, but it’s not the bulk of Intel’s focus currently. AMD’s CPU’s, such as the FX range have firmly embraced multi core, and Zen is no different, with leaked roadmaps pointing to eight cores for AMD’s next generation desktop (with servers rising to 32 cores).

These additional CPU’s allow multi-threading in creative or ‘work’ applications, such as Photoshop, Winrar or dare I say, Excel, and performance goes up through the roof compared to fewer cores (if you’ve a decent BIOS and fancy experimenting, try restarting your computer, after you’ve shared this article with your friends of course, and disable a few CPU cores, then run a benchmark, restart, re-enable the cores and run a benchmark again and you’ll see what I mean). Window’s (and other OS) have vastly improved their multi-threading capability version over version, and while it’s out of the scope of this article, it’s improved a lot faster than DirectX.

cmd_buffer_behavior-dx12

DirectX 11’s Multi-Threading is considerably poorer, and has a few main issues. The first is it doesn’t distribute the work across multiple CPU’s cores, so you’ll often have a single CPU core (typically the first core, core 0) that has a massively amount of workload (often or close to 100 percent) and the other cores aren’t doing much. This is because D3D11 isn’t able to think in parallel, and can’t break down the games Command Buffer into ‘chunks’ to run across numerous cores. Thus, the other CPU cores are often performing considerably less work, and you’re leaving performance on the table.

Compounding the problem is that the driver and API is extremely ‘heavy’ on the CPU (known as overhead), because of a large amount of abstraction. So not only can the game (application) only really be run effectively across one core, but a lot of that cores performance is being used on the API and Driver, therefore performance tanks.

DirectX 12 changes this – considerably. DirectX 12’s Command Buffer’s were built with parallelism in mind. The first change is the driver overhead – because the API and drivers are now lighter, less CPU time is spent on processing them, increasing performance.

Command Buffer Recording, according to AMD’s Robert Hallock (in an Exclusive Interview) is “a specific feature of DX12/Mantle/Vulkan that allows the CPU to accept & dispatch command buffer submissions on all cores.”

Legacy Command Buffer flow
Legacy Command Buffer flow

“DX11 is pretty single-threaded, and there’s high overhead at that, so it makes the CPU a bottleneck for no good reason. DX12’s major advancements are to widen CPU parallelism, and to keep the GPU busy more often (async shaders) for higher FPS overall.”

To put it simply, each CPU core is now offered the ability to speak to the GPU, and this scales based on the number of logical cores available in your machine. Latency is the key here – the longer the CPU needs to send instructions, or the longer a GPU needs to process them, the fewer frames per second you can enjoy as a gamer. We can get an idea how how this comes together by looking at the number of Draw Calls being issued in our API overhead testing results.

With Virtual Reality extremely sensitive to latency, the move to eliminate it isn’t surprising – AMD have also developed Liquid VR to help, and of course at Build 2015 we learned that developers can leverage multi vendor GPU’s to process different graphics tasks, including having say an Intel IGP handle VR warping while an Nvidia GPU handles complex 3d graphics tasks. Check our analysis of DX12 Build here.