About a week ago, we’d taken a rather in-depth overview of the Xbox One’s SDK leak. The leak, for those unfamiliar with it, provided a glimpse of what lies inside the Xbox One’s silicone, and how the machine has evolved over the years Microsoft was both designing it and since it was released back in November, 2013. It was evident Microsoft had not only changed the focus of the machine (my lessening the focus on Kinect, and pushing more of those resources to game developers) but also rather large improvements in the machines drivers and SDKs (completely re-writing the consoles graphics driver to be more efficient).
In this article, we’ll be focusing on GPU performance, and tackling things such as the two Graphic Command Processors present inside the Xbox One, and performing analysis on their potential use and purpose. To discuss this we’ll also need to dive a bit into how various components function, and of course we’ll do a nice break down to simplify things as much as possible. Without wasting further word count, let’s begin, eh?
We’ve discussed the basics of the hardware inside Microsoft’s Xbox One several times by now, and if you require a refresh I’d suggest you read the first part of our SDK leak analysis. I’d also suggest you read it if you haven’t and you want a more complete understanding of the Graphics API’s, software, buses and bandwidth of the system (which we’ll touch upon in this article, but it’ll not be the focus).
The Xbox One GPU is based on an AMD GCN (Graphic Core Next) DirectX 11.1 GPU, the closest desktop variant would be Bonaire (more specifically, the Radeon 7790 – albeit with some customization). It’s API is a heavily modified version of DirectX, known as Monolithic Driver (more info) that’s been designed to be more console like, and do away with (among other things) layers of abstraction, due to the fixed hardware of the Xbox One.
To understand Microsoft GPU modifications, it’s vital we establish a basic understanding of AMD’s GCN architecture, which is not only of the basis of the Xbox One, but also the PS4’s GPU’s. At the highest level, the Xbox One’s GPU is comprised of 12 Compute Units, or CU for short (but be aware, Microsoft doesn’t stick with AMD’s official naming convention, and instead refers to the Compute Units as Shader Cores). Technically, there are 14 CU’s on the Xbox One’s die (and 20 CU’s on the PS4), but two have been disabled on both consoles to increase yields, leaving 12 and 18 CUs for each console respectively.
A Compute Unit is partitioned into four separate SIMD Units (each of these can run between 1- 10 wavefronts, which we’ll discuss soon in another article). Each SIMD contains 16 ALU’s (Arithmetic Logic Units, sometimes these are referred to as Shaders). In addition to the four SIMD units, you’ll find a L1 Cache, a LDS (Local Data Share, Microsoft changed the convention to LSM, or Local Shared Memory in its documentation), 4 Texture Units and a scalar unit. The Scalar Unit handles arithmetic operations the ALU’s can’t or won’t handle (for example, conditional IF / When statements).
In addition to the CUs, the Xbox One’s GPU features 512KB of level 2 Cache. L1 cache is faster, but is shared just between that CU, where as the L2 cache is a ‘shared’ resource for the entire GPU and finally, 16 ROPS (Raster Operators, these are the final stage in a scenes rendering, and ‘assemble’ the scene). This means that the Xbox One’s GPU contains 768 ALU’s, which put out a combined performance of about 1.31 TFLOPS of computing power (this isn’t accounting for any performance that’s been allocated towards the OS).
So what’s different with the Xbox One GPU?
The Xbox One features two ACE (Asynchronous Compute Engines) and two Graphic Command Processors which, along with the 8 Graphic Context’s has spurred a flurry of gaming websites to report the news – but it raises several questions, which we’ll attempt to answer in this very article. The first question: what resources are available for developers, after all, we know there’s 8GB of RAM in the Xbox One, but 3GB is allocated purely to OS use, leaving just 5GB for games. The second question: While it might sound impressive, how ‘different’ is it from either the PS4 or the plain regular GCN architecture? The third question (and the one I suspect most of you will care about) what does it do for the games? Will they have better detail, run at a higher resolution… higher frame rate?
Unfortunately, we need to firstly discuss what are Graphic Contexts, ACE and Graphic Command Processors (known as GCP). If you’ve a good understanding of this, you can skip the new few paragraphs! If If you’re unfamiliar, then consider the information below a very basic primer to give you an understanding of what’s going on.
Graphic Context: Think of a graphics context as a state where it can be saved and then resumed at a later date. It defines basic drawing attributes of a scene, such as the colors to use, the basic layout of a scene and so on. The next logical thing to discuss is Context Switching, which is business as usual for a GPU, and its purpose is to keep the GPU utilization as high as possible. The reason for this is because GCN is an In-Order processor (instructions are fetched, executed & completed in order they are issued. If an instruction stalls, they it causes other instructions ‘behind it’ to stall also), and thus ensuring the pipeline is running smoothly is critical for best performance.
GCP / Graphic Command Processor: Their (remember, there’s two in the Xbox One) job is to communicate with the host CPU (the AMD Jaguar CPU, in the case of the X1) and to keep track of the various graphics states, read commands from the command buffer. In other words, its job is to tell the GPU to ‘draw stuff’ and to keep track of the various bits of data. The Graphic Command Processor tries to run ahead of the GPU as much as possible so that it’s better able to know what’s coming up, and delegate work accordingly. For a very simplified example, imagine that if the GPU’s shaders are processing instruction 25, the GCP would like to be at instruction 30.
ACE / Asynchronous Compute Engines: Their job similar to the GCP, but instead its for Compute work, in other words, if the developers is using the GPU for physics or other such uses. It dispatches the compute tasks to the CU’s, manages resources and naturally interprets instructions.
Now that we’ve got that out of the way, let’s start things out with the GCP, for which there are two inside the Xbox One. Unfortunately, because we’re still missing certain documentation from the SDK leak, we can only make a few educated guesses as to the second GCPs usage, but there are a couple of leading theories. Desktop GPU’s (such as the Radeon 7970) only feature one GCP, but according to rumors and leaks, the Playstation 4 does indeed feature two GCPs too. This would mean that the leading theory is that the second GCP is for OS tasks – for example, running snap (and other OS displays). While the Xbox One does have a percentage of its CPU locked for only OS (1 full core, and 20 percent of another in extended mode, or 2 core in regular mode – depending on if developers wish to make use of the custom Kinect voice commands), the visual display still needs to be rendered on screen.
It’s also possible that the GCP is responsible for helping issue additional Graphic Context’s (which we’ll discuss soon), but the first explanation is more likely. Throughout the SDK documentation there’s only reference to a single GCP, which likely means developers simply do not have access to the second one (or one would assume it’d be mentioned and how to best leverage the performance of both of them). We can speculate that there’s a good chance Microsoft did indeed customize the Command Processor(s) to an extend, because of a quote from one of the Xbox One’s architects, Andrew Goossen with EuroGamer: “We also took the opportunity to go and highly customise the command processor on the GPU. Again concentrating on CPU performance… The command processor block’s interface is a very key component in making the CPU overhead of graphics quite efficient. We know the AMD architecture pretty well – we had AMD graphics on the Xbox 360 and there were a number of features we used there. We had features like pre-compiled command buffers where developers would go and pre-build a lot of their states at the object level where they would [simply] say, “run this”. We implemented it on Xbox 360 and had a whole lot of ideas on how to make that more efficient [and with] a cleaner API, so we took that opportunity with Xbox One and with our customised command processor we’ve created extensions on top of D3D which fit very nicely into the D3D model and this is something that we’d like to integrate back into mainline 3D on the PC too – this small, very low-level, very efficient object-orientated submission of your draw [and state] commands.”
Just how Microsoft “Highly Customized” the GCP is of course, down to debate. It’s possible we’ll see some DirectX 12 functionality, such as Draw Bundles – but it’s too difficult to know for certain. Many zeroed in on the phrase “In particular, compute tasks can leapfrog past pending rendering tasks, enabling low-latency handoffs between CPU and GPU.” – but in reality it’s pretty much business as usual for the GCN architecture.
It’s always possible that the customization were blown out of proportion by Microsoft, but rumors behind the scenes is that they did implement some changes… it’d make sense, given that the Xbox One’s Monolithic Driver supposedly helped inspire DX12 (well, that and a nice dose of AMD’s Mantle technology, once again, if rumors are accurate).
Regarding the number of Graphic Contexts, let’s first read over what Microsoft says on the matter: “The Xbox One GPU has eight graphics contexts, seven of which are available for games. Loosely speaking, a sequence of draw calls that share the same render state are said to share the same context. Dispatches don’t require graphics contexts, and they can run in parallel with graphics work.” In a different part of the leaked document it says: “The number of deferred contexts you want to create during initialization time depends on the maximum number of parallel rendering tasks the engine needs to perform anytime during rendering. Although the system allows a maximum of 48 deferred contexts to exist at any one time, in general you shouldn’t create more than six deferred contexts at once, because that’s how many cores you have in the game OS. Of course, it is up to you to precisely tailor your thread usage for maximum efficiency. For example, if one deferred-context thread is waiting for a direct memory access (DMA) operation, you can swap in another deferred context thread to use the otherwise-wasted CPU time on the same core. In this case, having more than one deferred context and deferred context threads per core prevents a CPU bubble.”
So what is a ‘deferred context’? The keyword is ‘deferred’ – it means that calls (think instructions) aren’t executed straight away. instead they are sent over to the command list to be executed at a later date.
You’ll notice that the number 48 divides rather beautifully into 6 (which is the number of CPU cores developers have by default, unless developers opt to free up some of the 7th core from its Kinect and OS masters). So., because of this only 6 (or possibly 7) of the cores are able to send work ‘per cycle’. But, remember, sometimes an instruction can take longer than another to execute. If that’s the case, if just working on ‘current’ contexts the CPU time will go idle, which is clearly far from ideal. You want to hit as close to 100% usage across all available cores as much as you can. Thus, if you’re waiting for a slow operation (in Microsoft’s example, a slow memory operation) another deferred context will start up on the same core (that’s currently waiting for the memory operation).
We don’t have enough information to make a guess on the Playstation 4, but we can make a guess on the PC thanks to a variety of different documents that have been released, including the Southern Island Programming guide. On page 14 “MAX_CONTEXT; Maximum Context in Chip. Values are 1 to 7. Max context of 0 is not valid since that context is now used for the clear state context. For example, 3 means the GPU uses contexts 0-3, i.e., it utilizes 4 contexts.”
The typical desktop GCN architecture according to documentation processes a single context at a time. Naturally (as we’ve just seen above), it’s possible to operate on multiple, but to do this you’ll need to run them serially and context switch. GCN also process compute, and can process as contexts based on the number of ACE’s that are available. Typically, in CPU’s the length an operation takes is known as a ‘Time Slice’. The length of each time slice can be critical to balancing system performance vs process responsiveness – if the time slice is too short then the scheduler will consume too much processing time, but if the time slice is too long, processes will take longer to respond to input.
A lot will obviously change in the future with DirectX12 – but how this integrates with the Xbox One isn’t known. With the current DirectX 11 mode, the CPU talks to the GPU a single core a time (so in other words, not in parallel). In the DX12 future, this will change because each core can issue instructions to the GPU and talk with the GPU simultaneously. How much of a difference this makes in GPU bound scenarios (particularly for the Xbox One) remains to be seen, but for CPU bound scenarios (particularly for Windows) it’ll be a nice performance increase.
Compute on the Xbox One runs in parallel with graphics workloads, a set of ‘fence’ API’s synchronize the execution between these contexts and the CPU. L1 cache is shared between both compute and graphical data. We do know that Microsoft didn’t implement a ‘Volatile Bit’ inside the GPU – which allows the application to delete (for instance) a single line of code. Mark Cerny discussed this back with GamaSutra, where he said “…to support the case where you want to use the GPU L2 cache simultaneously for both graphics processing and asynchronous compute, we have added a bit in the tags of the cache lines, we call it the ‘volatile’ bit. You can then selectively mark all accesses by compute as ‘volatile,’ and when it’s time for compute to read from system memory, it can invalidate, selectively, the lines it uses in the L2. When it comes time to write back the results, it can write back selectively the lines that it uses“,
We know this isn’t implemented in the Xbox One because it specifically says in the SDK documentation: “Note that cache flushes affect the entire cache; range-based cache flushes are not supported by the hardware. This may affect any GPU work executing on the graphics context at the same time.” This means that cache flushes are the only way to resolve certain situations, which obviously will be a performance penalty (as Microsoft have just described).
Another slight problem with the Xbox One’s GPU (compared to say the Playstation 4’s, or indeed a more modern PC GPU such as for example the R9 290) is that the total number of ACE’s is lower. While the two ACE’s on the Xbox One can handle eight queues each, the Playstation 4 (and modern desktop GPU’s) support 8 ACEs. In the case of the PS4, this means that the GPU can handle a total number of 64 compute queues, which combined with the Level 2 Volatile Bit certainly gives the PS4 a bit of a helping hand in certain situations.
Stay tuned for the next part, where we’ll tackle the Xbox One’s audio processor and other bits and bobs inside the system!
Further References:
In-Order vs Out of Order Processing
GCN Architecture White Paper
Pre-emption / Time Slicing
Nvidia Deferred Context
AMD Southern Island Programming Guide