With GDC 2015 fast approaching, it’s high time we got through the last major hardware parts of the Xbox One’s SDK leak. In this particular article, we’ll cover the DMA Move Engines and the remaining questions around the Xbox One’s memory bandwidth and copy operations. In the forth part, we’ll tackle the SHAPE audio processor and the remaining hardware components!
There are four DMA move engines are built right onto the GPU block, and are ‘fixed function’ (meaning you can’t tell them to do something completely different, they’ve a job to do and that’s all they’re happy in performing. This is opposite to say a CPU, which can run a vast variety of code and instructions). So we’re on the same page, DMA stands for “Direct Memory Access” and as you’ve probably gathered by now, their job is to move data around the Xbox One’s memory system. The idea behind their implementation is they can move data around the machine while reducing the overhead on either the consoles CPU or GPU, and they can compress data as they move it about (and perform a few other functions). Move Engines aren’t a completely new technology, and consoles and other systems have used similar technology before, but in some cases they weren’t fixed function. Jumping back a generation, Sony’s Playstation 3 used the Cell Processor, and its SPU’s could also do these type of operations (for more on the Cell processor, see here). It’s important to note though, that SPU’s were quite crucial for graphics (and other operations) and thus using them for moving data about isn’t as ideal.
The four move engines can perform copy operations to and from and combination of system memory (the DDR3 RAM) or indeed ESRAM. While this data is being moved about, it’s possible for the Move Engine to perform a “Texture Swizzle” or “un-swizzle” (which means you’re effectively rearranging elements of a vector. If you’re so inclined to read more check this link as an in-depth explanation would eat up a lot of word count. Microsoft say that this can handle any of the Xbox One’s various tiling patterns for texture).
Developers can access three of the DMA engines by using the Xbox One’s System Direct Memory Access. Microsoft suggests this data is typically information such as Textures (or other such image data) but not really so well suited to Audio or Video. The fourth engine is shared between both system and title (in other words game usage).
Move engine 1
Plain copy, swizzle/unswizzle (title exclusive use)
Move engine 2
Plain copy, swizzle/unswizzle (title exclusive use)
Move engine 3
Plain copy, swizzle/unswizzle, Lempel-Ziv (LZ) lossless encode/decode (title exclusive use)
Move engine 4
Plain copy, swizzle/unswizzle, JPEG decode (title/system shared use)
According to the Xbox One’s official SDK, Move Engines operate at 256 bit (the same as the bus for the rest of the X1) and the system has 25.6 GB/s of read and 25.6 GB/s of write memory bandwidth shared between the DMA engines as follows:
- DMA engine 1, DMA engine 3, display scan out and write-back, and video decoding.
- DMA engine 2, DMA engine 4, video encoding, and vertex indices DMA.
When the GPU is busy writing data and a DMA engine is told to copy data from one type of memory to another (DDR3, ESRAM) the GPU shares the bandwidth evenly between both the source and destinations. In other words, you’ll not be reading faster than your writing – for example.
Microsoft points out that “Display Scan Out” (in a nutshell, this means the ‘finished’ image, or frame of animation that’s currently held in the frame buffer is going to be sent on its merry way out of the consoles HDMI connection) consoles about 3.9GB/s of Read bandwidth. Their maths for this work with 3 display planes X 4 bytes per pixel x HDMI limit of 300 MegaPixels per second, 30 bits oer pixel x 300 MegaPixels per second.
As mentioned above, the Move Engines are capable of performing LZ77 Encoding or Decoding – which is a lossless form of compression. If you save an image in say the .jpg format (particularly if you shrink the images file size down significantly) you’ll spot obvious quality loss. This is known as lossy compression – true the file size is shrinking, but you’re also ‘throwing out’ data in the process. Compare that to say zipping a file into an archive, you’re not ruining the file, you’re simply compressing it.
This is great for textures (or some other piece of data) that won’t be used for awhile (several frames) but must remain in RAM. It can be ‘moved out’ of say ESRAM into DDR3 and compressed along the way. This LZ77 isn’t a new format, it was used even back in 1977 and is used by Zlib, Glib and other similar libraries. The very same Move Engine that supports LZ decoding supports JPEG decoding too.
Xbox One Memory Bandwidth
The peak of the Xbox One’s instantaneous read bandwidth is technically 204 GB/s a second, and for a console is a huge number. But, Microsoft are quick to point out that this speed isn’t sustainable in the real world. The maximum read bandwidth from both the X1’s DRAM (in other words, the 8GB of DDR3 inside the system) along with the ESRAM combined can only be 170GB/s, and so that’s the ‘max’ which Microsoft quotes is the max read speed of the GPU. Microsoft’s figure of 133GB/s of ESRAM bandwidth was attained by ‘alpha blending an FP16 x 4 render target’.
On the same subject, Brad Wardell from StarDock recently discussed the Xbox One’s future with DirectX 12, and indeed said Microsoft’s console has “crumby” bandwidth due to the DDR3 RAM, and isn’t sure of the long term impact of ESRAM will have or if it will resolve all of the issues the console faces. The good news is: he did say that that it’s very likely the CPU and GPU will have enough grunt to deal with the increased number of draws (and other goodies) dZ12 brings to the table.
Not all functions of the GPU share the same memory bandwidth, and indeed several functions share a lower bandwidth of only 51.2 GB/s bidirectional. Not all of these functions were listed inside the leaked SDK documents, but here are those which were listed: Video Encoding / Decoding Engines, Front Buffer Scan Out, DMA Engines, Command Buffer and Vertex Index Fetch DMA.
Internally the hub clients are connected to one of two 25.6 GB/s internal buses. If two hub clients (in other words, two sets of components) aren’t sharing the same single bus, they’re able to use up to 25.6 GB/s of memory bandwidth each. On the other hand, if they do share this bus, then the bandwidth of that bus is split between them.
|Source memory||Destination memory||Maximum read bandwidth (GB/s)||Maximum write bandwidth (GB/s)||Maximum total bandwidth (GB/s)|
In the above example, you’ll spot that the maximum transfer rates available the Xbox One’s 8GB of DDR3 act as the slowest ship in the convoy. The ESRAM can’t have data copied into faster than that data is read. Think of it like you’re a really really fast typer, copying text from a box that pops up on screen. Let’s say you can type 100 words a minute, but the box only puts up 50 words a minute – you can’t enter the words any faster than 50, because that’s all you’ve got to work with.
In part two of our SDK analysis, when we were taking an in-depth look of the Xbox One’s GPU architecture, we’d pointed out a major weakness in the Cache system of the GPU block compared to Sony’s Playstation 4. Mark Cerny, the lead architect behind the PS4 had revealed a modification Sony had implemented specifically into the L2 cache of the console was the so called ‘Volatile Bit’. This allowed developers (or code running on them) to be super accurate, and invalidate a single line of code inside the systems L2 cache – particularly helpful while dealing with compute functionality. The other option, is a costly Cache flush.
Microsoft made references to this fact several times in the documentations, some of those we’d discussed (once again) in part two. But, in another reference (which I’d missed initially) they also added “Because the GPU is I/O coherent, data in the GPU caches must be flushed before that data is visible to other components of the system”.
In the image above, you’ll spot that 3GB/s per direction of bandwidth is eaten up by audio, camera, HDD (Hard Drive access) and PCIe components. Microsoft do point out that the Xbox One’s sensor is the which gobbles up most of this memory bandwidth; the HDD will only eat about 50MB/s (which is pretty much nothing). This is also an example how the later SDK’s have benefited developers – effectively it provides greater memory bandwidth to the North Bridge Coherent memory bandwidth. Well, we can assume so anyway, based on what Microsoft have said regarding disabling Kinect frees up bandwidth for the consoles RAM. Once again, to reiterate, it’s important to remember you as a customer simply unplugging your Kinect won’t work. It requires the developers to specifically ask for this via functions, similar to the additional GPU reserves or seventh CPU core.
On the subject of CPU for a moment, we can see how titchy the bandwidth that’s used by the CPU in comparison to the hungry GPU. Each module hits 4GB/s read, or 3.5GB/s WriteCombine. There’s also bandwidth available to the CPU that’s non-coherent. So what does all of this mean? In most operations, Microsoft asserts that the North Bridge’s 30GB/s of total bandwidth is fine and dandy, but there are scenarios (when a lot of coherent data comes in) that things can get a bit slower, and thus you might experience increased access and latency times.
How Much DDR3 Bandwidth Does The Xbox One’s GPU Have?
We can make a few typical assumptions using the SDK and assume the bandwidth eaten up by data from the North Bridge is around 25GB/s, a portion of the GPU coherent data misses the caches and non coherent CPU is 3GB/s. This means that the Xbox One’s GPU is left with a theoretical 42GB/s of bandwidth from the consoles main memory pool. Once again, this is assuming ‘normal’ operation and typical workloads, and likely with Kinect reserves put in place, so this might have been improved a little.
If you take a look at any mid range GPU (particularly prior to the introduction of Nvidia’s Maxwell or AMD’s Tonga architecture, which made significant improvements to compression and thus technically ‘can do more, with less bandwidth) you’ll doubtlessly notice 40GB/s isn’t going to cut it. GPU’s of the PC have mostly abandoned DDR3 (though a few low ends cards do, remember they have the full 68GB/s bandwidth of DDR3 too), but those DDR3 GPU’s get trounced by their GDDR5 variants.
Thus, the Xbox One uses the ESRAM to help offset the bandwidth – we’ve discussed ESRAM a few dozen times by now, including in part one of our SDK breakdown; but there’s one area we’re yet to mention. The Xbox One is capable of running Split Render Targets. What does that mean? Well, quite simply sky textures, regions of the screen with overdraw, ground textures and so on are best rendered from slower DDR3. You can also have say the Z-Buffer (also known as the depth buffer) in ESRAM while keeping the color targets in say DRAM. In the example above, 70 percent of the image (roughly) is held in ESRAM, with the ESRAM hitting 11455K (11MB) vs the DRAM 4864K (just under 5MB).
It’s not a perfect scenario, and optimizing to reduce memory bandwidth usage is ‘a key strategy for Xbox One’ according to Microsoft.
With all of that said – what does that mean for the system? The Xbox One’s a more complicated beast in terms of the distribution of memory than Sony’s Playstation 4, and this isn’t news; we’ve discussed the same thing a few times now. But Microsoft have provided a few tools and hardware components to ease the performance penalties.
While GPU performance improvements (and freeing up GPU resources from Kinect) have grabbed a lot of gaming news headlines, it’s possibly the additional bandwidth (both from buses and from RAM itself) that will prove to be the more flexible and helpful tool for developers.
Reading through the SDK documentation, there’s another point which is abundantly clear – major improvements to the bug finding and performance analysis tools available for developers (for example, PIX – Performance Investigator for Xbox) allowing developers to better understand what’s gobbling up bandwidth and better optimize say the code or assets which are causing the bottleneck.
Stories have floated about (and of course, none of this is backed by Microsoft, so it’s theories, smoke and mirrors) that ESRAM was selected because Microsoft weren’t comfortable able the pricing or abundance of GDDR5 (in the 8GB quantity they knew they’d need to pack into the Xbox One). With their early focus on Kinect and media, they were forced to have a lot of RAM in the system, and thus couldn’t run the risk that GDDR5 prices wouldn’t stable, or there’d be shortages. Thus, the Move Engines and DDR3 architecture was used.
Apparently, many developers (including some first party devs) hadn’t known the final retail PS4’s would be running 8GB RAM, and it’d be a very last minute decision by Sony. Supposedly, many developers had been running under the assumption the system would have had 4GB total (which begs the question how much RAM would have been reserved by system and how much available to developers). Reading through the documentation, there are hints Microsoft had wished they’d had opted for the GDDR5 route. This would have given extra die space (since no ESRAM or Move Engines) and thus extra shaders could have been implemented and therefore, a more powerful GPU.
But of course, all of this is just theories and what ifs – and while it’s an interesting thought exercise, it doesn’t alter the here and now. It’s rather fascinating how the console has evolved over time though, isn’t it?