The drilldown of new Xbox hardware has been a tradition for Microsoft, and after the rather extensive rundown for the Xbox One console before its 2013 launch, I was very excited to see what Microsoft would reveal of the Xbox Series X architecture.
There are still some mysteries for the machine, but overall Microsoft has provided pretty extensive insight into the inner workings of the console, particularly for the SOC and memory configuration of the Xbox Series X.
As there’s so much information which has been provided here, I want to go through things as a two-parter, and of course, there’ll be an accompanying video too.
Let’s start things out with an overview of the SoC itself, the so-called ‘physical view’ of what Microsoft has managed to jam into the confines of the silicon. The chip is made using TSMC’s N7 Enhanced process, though further probing from questions after the event had Microsoft decline to say state exactly what the enhancements were.
According to the AnAndTech write up it was “It’s not base 7nm, it’s progressed over time. Lots of work between AMD and TSMC to hit our targets and what we needed”
The die measures 360.4 mm2, which is actually identical to the size of the Xbox One X die, the difference though is that the 7nm process versus the 16nm process used in the Xbox One X, oh and the 15.3 billion transistors in the Xbox Series X over doubles the 7 billion found in the Xbox One X.
Here we can also see the first ‘hint’ of how the APU itself of the console is divided up, with the GPU taking more than a significant portion of the overall die space. The GPU this console is quite frankly, massive. At the top and sides are the GDDR6 memory controllers (10 total), and of course, we also see the various CPU cores too, which are in two clusters (four cores in each cluster). Yep, the Xbox Series X definitely isn’t a small APU.
Microsoft doubles down on the specs of the machine, and so far it doesn’t confirm any upgrades to the specs over what we had already known. There is however confirmation to several new things which I’d guessed several times in videos (for example the size of the cache of the various CPU clusters).
Speaking broadly, the system is 1.825GHz, 52 Compute Units (which equals to be 3328 Streaming Processors, or shaders if you prefer) and this is about 12.1TFLOPS of raw FP32 performance (that’s full precision operations).
Thoughtfully, Microsoft also provides a SoC Block Diagram, which shows the basics of how the APU communicates with different parts of itself, and of course how it also talks to other parts of the system.
The Zen 2 cores proudly state their cache, and once again push that they’re separate dies, but still, the Zen 2 cores positively crush the Jaguar CPUs in either of the previous generation consoles as seen in these benchmarks. Yes, these are PS4 Pro GeekBench results, but the PS4 Pro and Xbox One X are virtually the same CPU, with the Sony machine running about 200mhz slower than the Xbox.
You can see that the Ryzen 7 3700X (also powered by 8 of AMD’s Zen 2 cores) here physically dominates the Jaguar processors, giving you perhaps some idea of what to expect in terms of a performance leap in CPU power from one generation to the next. As a note, I manually held the CPU here at 3.5GHZ, which was the rumored clocks for both the next-generation as of the time I made the graphs, but given this is only 100Mhz slower than the Xbox Series X CPU with SMT enabled, and the Xbox Series X has smaller cache, I think the results are still fairly accurate.
The CPU has two different modes, SMT on and off (so either 16 or 8 threads) and this affects the clocks, with 3.6 or 3.8GHz clock speed, respectively. Also, Microsoft confirmed in the EuroGamer interview that for allocation, 1 CPU core is allocated to the operating system. To be clear, this is 1 CPU core and both of the threads (SMT). These threads are likely locked to a single core, as you don’t want some of the L2 cache (for example) of a core to be filled by OS stuff and also sharing this with game data. It’s better instead if Microsoft separates these the best they can and allocate a single CPU core.
It’s not quite clear yet though how this functions when developers are running with SMT disabled for the Xbox Series X and how OS allocations change respectively.
Again, 16GB GDDR6 memory, with the fastest memory giving up to 560GB/s bandwidth (this is the 10GB pool) and the low memory interleave is only 6GB.
Microsoft also doubles down on the other features of the console, with the Xbox Series X supporting Sampler Feedback Streaming (SFS, which I’ve covered in a video previously but I’ll also go over further in part two of this coverage)), DirectX Ray Tracing (which is mostly just Microsoft’s branding and feature set for Ray Tracing, for example, you can run ray tracing on an Nvidia graphics card using the Vulkan API, and Sony uses their own API for largely the same effects, we’ll talk about that more later), Variable Rate Shading support, and finally, Mesh Shading.
So let’s start things out with the CPU, which is again a Zen 2 based processor with 8 cores. The Xbox Series X has retained the 512KB of L2 cache per core as other implementations of Zen 2 (such as on the Ryzen 3000 series for desktop), but the XSX cuts the amount of L3 cache by a significant amount (it’s 1/4 the size of the 8 core Ryzen 7 3700X). This is perfectly inline with AMD’s Renoir APU, and honestly makes sense.
The Zen 2 processor die for 8 CPUs cores and its accompanying Level 3 cache is 74mm2, with a very large portion of that space being the L3 cache of the chips. Quite simply, Microsoft had to consider whether or not the die space trade-off for L3 cache was worth it and clearly opted to save the die space for ‘other things’ such as more CU.
This was honestly a wise cut from the perspective of Microsoft. Microsoft also confirmed that the L3 cache and the two CPU portions are essentially ‘separate’ on the die, and not unified.
I did report in a video that the Playstation 5 might have a unified L3 CCX as found in AMD’s Zen 3 architecture (although the rest of the CPU functionality is only Zen 2), but I am unsure about this. The purpose of a unified cache would be to reduce latency from access data which is held in the other cores L3 cache. For example, core 0-3 level 3 cache holds a piece of data that core 5 from the other set of cores needs to access, there’s a larger latency penalty associated with this as opposed to simply pulling the data from the local cache.
This also could be said the same for a local core on the same cluster, if the data is held on one of its caches or just in general too. Of course, Microsoft also states that their Zen 2 cores are ‘Server Class’ which is a rather interesting statement.
Once again, a question from a journalist highlighted over at AnAndTech tried to dig into this “[it] says Zen 2 is server class, but you use L3 mobile class?”
They received this answered: “Yeah our caches are different, but I won’t say any more, that’s more AMD.”
Unfortunately, this statement isn’t very enlightening, as it’s clear that the chiplets and caches are separate and once again, the L3 cache is smaller, but Microsoft doesn’t specify what the changes are, and it seems to be something AMD is working on for specific products. Perhaps something to do with tweaks for improved latency, but it could be something entirely different which leaves us with one of the first big questions remaining for the Xbox Series X.
Okay – and now let’s scoot our way over to the GPU, which again, is easily the largest component on the Xbox Series X’s APU. The GPU portion is just under 50 percent of the total die space used, and contains 56 Shader Units, with 4 of these being disabled for yield purposes (as was the approach Microsoft and Sony used the current generation, ie, disabling CU for yields). Microsoft is clocking these CU at 1825Mhz, which provides just over 12.1 TFLOPS of FP32 performance for the Xbox Series X.
Again, you cannot compare TFLOPS from one generation of consoles (or graphics processors in general) to another though, so please don’t take this as ‘just’ about double the performance from say the Xbox One X to the Series X. RDNA 2 offers several generations worth of IPC (Instruction Per Clock) increases in performance in multiple areas, including functions which FP32 cannot give any insight into, such as Geometry performance which is incredibly important with rendering games.
Interestingly, Microsoft has also indirectly confirmed that there’s only 64 ROPS for the Xbox Series X too. This was long guessed by myself based on the leaked Arden test data from GitHub, but obviously it was possible that this would differ from the final production hardware.
But, with the GPU specs for “GPU Evolution” Microsoft confirms that there are 116 GigaPixels / second for its performance. This figure can easily be reached by taking 64 ROPS and multiplying it by the clock frequency (so 64 x 1825) and this, of course, comes to the 116. This ROP figure then would be the same as Sony have for the Playstation 5 GPU.
The very interesting thing right off the bat is that Microsoft managed to get AMD’s go-ahead to reveal so much of the RDNA 2 architecture from the company ahead of the official reveal. Though, given we still don’t officially know how the desktop implementations of these GPUs will be specced, and given Nvidia can’t exactly respond by changing their design, and it’s likely also AMD’s way of getting some hype for RDNA 2 given PC enthusiasts as of the time I am writing have all their eyes focused on Nvidia for Geforce Ampere. Also, there are still a number of questions yet to be answered in this drill-down.
Starting things out, Microsoft shows that we still see a dual compute unit design, again in this example you can see the two inactive CU (which are darker) for yield purposes. Each of these ‘Dual CU’ for RDNA 2 basically contains two compute units, so you take the 26 active and multiply it by two, and this gives you the 52 active which Microsoft discussed earlier.
You will also spot that there is a significant amount (5MB) of L2 cache which is accessible via the various Shader Engines (and in turn CU) from the GPU. This L2 cache is designed to be a storage area for regularly accessed or still being used data, which isn’t being used by the GPUs various data registers or local CU caches. While 5MB might not seem like a lot compared to say, the 16GB GDDR6 memory, it’s a significant amount of memory to hold instructions and other data in. The purpose of cache is to stop the processor constantly needing to hit main system memory, which is not only slower in terms of raw access and latency, but it also thrashes system bandwidth too.
At a more local level, it seems that each of the dual CU contains its own Local Data share for RDNA 2 based GPU. The Xbox Series X contains two Shader Engines. and in turn, these are sub-divided into two Shader Arrays.
You can see in the diagram that each of these “arrays” contains up to 7 Dual CU (so 14 total, assuming there’s none disabled on this GPU portion), and the RDNA 2 Shader Array also is given its own Level 1 cache too. This forms the three-level cache which Microsoft describes for the Xbox Series X GPU diagram (L2 cache for the whole GPU, level 1 cache for the Shader Array, and the Local Data Share for the Dual CU).
Looking deeper into the Dual Compute Unit itself of the GPU, and we can clearly see that each dual CU contains 4 SIMD and 4 Scalar ALUs
Notice that that now we’re looking inside Dual Compute Unit, the two separate CUs are clearly visible, with their own SIMD32 ALU and Regs and various other mirrored elements. Also notice that the Ray Acceleration and Texture Mapping Units are ‘shared blocks’ too on each of the CU. We’ll get more into that a moment.
According to Microsoft during the lecture, we’re looking at RDNA 2 featuring about 25 percent performance per clock increase from the previous generation (RDNA 1), which is quite frankly very impressive and a testament to the engineering efforts over at the Radeon Technology Group.
Okay, so now back to the hardware-based ray tracing, which of course, is one of the biggest features on this next-generation of console architecture. Microsoft previously confirmed that the Ray Tracing was part of the Xbox Series X in an interview with Eurogamer, and confirmed that for the BVH calculations (that is, calculation of the ray / triangle intersection testing) without RDNA 2’s Ray Tracing acceleration would require an extra GPU which is about 13 TFLOPS in performance.
To be very clear here, these statements are very similar to the Pascal comments Nvidia, which was the predecessor to the Turing architecture. Nvidia was keen to point out how Pascal would get hammered if running these calculations on software without the dedicated RT hardware to accelerate them.
But, with intersection testing, this isn’t the only step in hardware-based ray tracing. With BVH (Bounding Volume Hierarchy) or another method, the scene basically has rays shot into it, with those rays calculating if a light source (for example) interacts with a visible portion of an object on screen, and in turn if this then bounces to another source.
Microsoft very clearly points out 380G/sec for ray-box, but this is only 95G for triangle. In other words, the figures are considerably lower for a scene which you could say would be more representative of a game.
Please also realize that the 380 figure is not ‘dwarfing’ the 10 GigaRay figure from Nvidia and say the RTX 2080 Ti. They are measuring two totally different things entirely. With AMD’s figure (AKA what Microsoft is providing), it measures AA bb traversals and intersection.
It takes a ton of these various intersections combined with triangle intersections to actually have a meaningful ray, which is basically what Nvidia is calculating with the GigaRay figure. This isn’t just a case of the measurements which are different, but they also likely ‘change’ on a per scene basis. It would be like trying to say that Farenheight is Celcius, but imagine you didn’t know what the conversation from F to C was, and if someone told you a data point that figure would then change a moment later to something else because the scene changed.
To really understand how well RDNA 2 compares to Turing or Nvidia’s soon to be released Ampere architecture, we need to run the games on various PC desktop GPUs at approximate performance tiers and see how the performance scales.
The Xbox implementation from RDNA 2 also essentially mirrors at a basic level what we saw from AMD’s so-called ‘hybrid Ray Tracing’ patent which had surfaced (I had previously been the first to leak AMD would indeed include HW based RT in the Navi 2x series of GPUs back in March of 2019, then subsequently fully covered the patent when it was discovered).
To keep things brief here, the slide “Dual Compute Unit” makes it clear that we’re looking at 4 Texture OR Ray Operations per CU.
To put this a different way, it confirms that there are 4 Texture Mapping Units per CU for RDNA 2, so 2 total per Dual CU. This confirms we’re looking at 208 Texture Mapping Units for the Xbox Series X GPU (this is with the 52 Active CU, so I am not including the 4 deactivate CU).
You can basically get either the number of texels per second or the 380G intersections listed by simply multiplying this 208 TMU figure by the clock frequency of the GPU.
Microsoft clearly states that the 380G/sec figure is the peak, with the number falling to ‘just’ 95G/sec for a more complex scene. The net performance also depends on total bandwidth, the number of nodes/tris visited per ray. So, for example, how much ‘stuff’ does the ray have to interact with. A good example is that a designer might choose to reduce the number of bounces a ray can have, OR we might have a simpler scene (say one which fewer reflective surfaces) and so on.
Notice though that shaders can still run simultaneous to BVH calculations, so the GPU isn’t just sitting idle while Ray Tracing is occurring. This is also similar to what Cerny said for the Road to Playstation 5 event at about the 30 minute mark. It seems though that both Sony and Microsoft did tweak the customization of their GPU somewhat.
Microsoft does state the Ray-Triangle units have some level of customization but don’t provide any real insight into how they’ve tweaked them compared to the standard desktop implementation of RDNA 2.
Essentially here (and this explanation still isn’t totally complete because we’re basing this off of the AMD patent and also comments from the Hot Chips conference as well as the block diagrams, but we can get a pretty good high-level overview), the GPUs shaders provide the TMU (Texturing Mapping Unit) an instruction, which is in a texture format. This provides both the ray data as well as points to the BVH node in the BVH tree (or, whereabouts in the BVH tree this data goes).
This information is processed by the TMU and then the ray intersection engine (which again, you can see in the block diagram) then gets fed this raw data, so then it can best decide how to continue the operation, by performing various tests.
Eventually, the shader units themselves receive the results of what was calculated and then calculates what’s the next node to tackle in the BVH tree.
Also, I wanted to touch briefly on compute. During the Eurogamer / Digital Foundry interview, Microsoft specifically stated that upsampling was possible on the Xbox Series X via machine learning, and this was via lower precision operations (4 or 8 bit), and of course, the GPU can also handle half-precision (16-bit) operations too.
16-bit operations aren’t anything new (Sony were pushing them heavily with the PS4 Pro, and were then subsequently on desktop with the Vega architecture) and RDNA 1 already supports them, but the lower precision 4 and 8 bit operations are new.
Microsoft references this with “ML Inference Acceleration” under the “other tricks” slide, and of course, this can be used for more than just upsampling the resolution of games too. AI and physics are also great examples of what can be used here, essentially machine learning running on the GPU as inference.
The inference is the act of running ‘trained’ data on a GPU, which is way less computationally less expensive than training the data in the first place. Microsoft specifically states there’s a very small area cost associated with this, but doesn’t mention the custom hardware in place.
Indeed, a lot of the RDNA 2 compute functionality isn’t touched in this specific drill down, and I believe the lower precision operations are part of RDNA 2’s base functionality (one of my earliest sources confirmed that RDNA 2 did improve on compute performance significantly over RDNA 1 to0).
This makes sense from a business perspective from AMD too, given that Nvidia is increasingly using DLSS 2 as a marketing tool (though Nvidia’s technology uses the GeForce’s Tensor Cores to run, not taking up shader time), AMD pushing a similar upsampling technology is not just logical, but needed to stay competitive.
On the PC, monitors such as 1440P 120Hz are super affordable, and honestly, even mid-range and budget screens go up to 165Hz for 1440P. 4K screens are coming down in price, and this will only continue in the future. Nvidia’s DLSS 2 tech allows a game to upsample from 720P to 1440P (4x the pixels) or 1080P to 4K (again, 4x the pixels). This benefits not only the flagship model cards for pushing out the higher frame rates, but also the lower to mid-tier products too.
Since AMD pushed Ray Tracing, and given the needs for upsampling in consoles and their competitors offering it in PC, even outside of rumors and my fairly solid information, just business logic alone points to AMD having both 4 and 8-bit operations to upsample.
There’s a lot more to learn from the hot chips conference, but this is already getting a pretty lengthy read so I’d like to draw a few conclusions, though in the next part we’ll be discussing features such as Variable Rate Shading, the memory configuration of the Xbox Series X, audio processing and of course features such as SFS.
The Hot Chip conference really shows that the Xbox Series X is a very powerful and well thought out machine. Microsoft has done extremely well in the design of the APU, and the sheer density of this thing (7nm process, with over 15 billion transistors and about 360mm2) it’s an impressive feat of engineering. Microsoft’s decisions here closely match to what I had expected prior to the event, throw a lot of parallel computing performance at the problem and largely allow developers to choose how and where to allocate the resources of the CPU and GPU.
Microsoft freely admits that technology such as Ray Tracing will be just a tool in the arsenal of developers, and it will accompany traditional rendering technology (AKA hybrid rendering). I do suspect that some games with certain artistic visions or specific looks may dabble in full path tracing for both next-generation consoles, but they’ll likely be indie affairs. The DXR Minecraft demo shows that there’s a lot of scope here though, and obviously this is still a pretty early example of the technology.
I will likely get a lot of messages on how the Xbox compares to the Playstation 5, and while we’re still waiting for the full breakdown of the PS5 architecture, from what Cerny and his time has revealed so far, and given what I’ve been told under the table, my opinion of both machines hasn’t really changed. The Xbox Series X is a very powerful, monstrous piece of kit designed for raw parallel performance, but Sony’s machine is designed for raw throughput – and these two approaches might sound similar, but they’re rather different.
Ultimately, for the ‘average’ game which doesn’t push either console to its limits, which console performs best probably will depend on what the lead development system was, what the game engine was used and of course, developer talent. But if a game fully leverages the platforms, both games will essentially get their own positives and negatives.
I’ll go deeper into this in a future article and video, hopefully after Sony finishes revealing all. But in the meanwhile, with luck you’ll join me for part two.