In a recent article, I went into how the Durango’s Move Engines are going to function, and I’ve also gone into depth on several other articles about performances of the CPU vs desktop, but in this article we’ll be going into the perf. As mentioned in those previous articles, there is an emphasis on the next generation consoles for parallel computing. There are a host of benefits of the CPU architecture that the AMD Jaguar will bring, including familar instruction sets.
Basic Hardware Analysis of the Xbox 720’s CPU
The Durango’s CPU is similar to the Playstation 4’s – two modules, each of the modules contain within it four X86 cores. Unlike the Xbox 360 Console (AmazonLink) cores, whose three cores handle 2 threads each (giving 6 threads total) the Jaguar inside the Xbox 720 handles only a single thread per each of the 8 cores (giving 8 threads total). Each of the Jaguar’s cores contains 32KB of instruction cache, and 32KB of Data Cache. There is 2MB of level 2 cache which is spread per module, which is shared between the fours cores of that module. This is different to the AMD Jaguar’s predecessor, which had 512KB of cache per core, but that data wasn’t shared. The cores communicate with this cache via the Level 2 Cache Interface, The other CPU module uses the Core Communication Interface to speak to its brother, and also the CCI is used to speak to the other components of the System (for example, the RAM) along with the North Bridge.
As we’ve just covered, there is 4MB of level 2 cache, although it’s split into 2MB chunks (2MB per module of 4 cores). If the CPU requests data that isn’t actually in one of the L2 caches, (called a L2 miss in this case) it will then go ahead and issue a check to the other modules cache and the Level 1 cache of the cores too. There is no level 3 data on the AMD Jaguar powering the Durango.
The CPU itself as we’ve mentioned already is an X64 design, which should be familiar to those of you with desktop PC’s. The design is pretty much the same, although the Jaguar uses a lower power solution. The purpose here is low heat, low power requirement, and as high performance as possible. Part of the reason for this is the simple size of the APU. There are also other factors to consider, such as price of the console too. X86 is an extension of of 32-bit X86. It is a Complex Instruction Set Computer (CISC) – but can actually handle legacy 16 bit code too – indeed some of the earliest code is still ‘around’ for the days of the very early Intel CPU’s.
Intel and AMD have introduced various instructions into CPU architecture through the years, although some – like MMX and 3DNow! are no longer used as they once were, and other more recent instructions have taken their place. In this case, X64 requires SSE2 as a bare minimum, although there are a number of other instructions that the AMD Jaguar for the Durango features.
- SIMD/vector instructions: SSE up to SSE4.2 (including SSSE3 for packing and SSE4a), and AVX
- F16C: half-precision float conversion
- BMI: bit shifting and manipulation
- AES+CLMULQDQ: cryptographic function support
- XSAVE: extended processor state save
- MOVBE: byte swapping/permutation
- VEX prefixing: Permits use of 256-bit operands in support of AVX instructions
- LOCK prefix: modifies selected integer instructions to be system-wide atomic
Xbox 360 CPU vs Xbox 720 Durango CPU Performance
There are some who’ve been confused by the performance of the Durango’s CPU – mostly because of the clock speed. Why did Microsoft and AMD choose to run the core at only 1.6GHZ, when the Xbox 360 is about double that speed. It’s easy to do a math (that’s wrong) like 3 cores at 2 threads each = 6 threads. Times that by 3.2GHZ and that’s 19200MHZ. Then do 1.6GHZ, times 8 (cores) and come up with 12800MHZ. But that’s not how CPU’s work in reality.
If you want a very simple example of this, take a look at Intel’s Pentium 4 CPU. A 3GHZ Pentium 4 would get killed in performance by a modern CPU that is running at the same clock speed. You might argue that a P4 has less cores, well – you are right. But even in single threaded performance. In other words, if you were to run the application on a single core and that the P4 was running at the same clock speed vs the same clock speed of a modern CPU. It’s just because of improvments in the core and overall design of the CPU.
Cache you Later, Xbox 360
Cache Performance for the Durango’s Jaguar is improved over the previous XBox 360 CPU. The CPU uses 64-byte cache lines. This is less than the Xbox 360 (which uses 128-byte), but is created to be a lot easier to use and less likely to waste precious bandwidth inside the Durango. The L1 cache inside the Xbox 720’s CPU’s are the same size as the Xbox 360 – but there is a major difference. This L1 cache isn’t shared between two ‘hyper threads’. This means there’s not only more available, but it can be associated better.
Level 2 meanwhile is for all intents and purposes 3 times larger than that of the Xbox 360. Remember that the Xbox 720 Durango has 2MB of level 2 cache per module (or 512KB per core, effectively). That’s a far-cry from the 170KB-ish figure of the Xbox 360 can muster per hardware thread. There is improvements on the way the bandwidth is used too, compared to the old xbox.
Speculative Execution – the good kind
No, we are not talking about innocent data that is about to be put to the magnet death. No, instead we are talking about predicting the flow and requesting of data. A CPU can simply stall until the conditional is dtermined, in other words, lower performance. The CPU would simply be waiting, trying to figure out what it has to do next and not actually doing anything of ‘value’.
The Durango has a trick up its sleeve for this, fetching the data ahead of time, this reduces the wait times where the CPU is just hanging about, effectively twiddling its fingers. It fetches ahead and predicts through multiple conditional branches, holding multiple basic blocks within its own buffer. It basically figures where these ‘branches’ are headed, and then predicts what it needs to grab hold of next, and it gets a valid result. Okay, sounds good – but what about if it guesses wrongly? In which case the results are ignored, and then the CPU looks for the data in the correct address location.
Out of Order Execution
The Xbox 360 CPU cores execute in-order (although it can also be called program order) instructions following the exact order the compiler laid them out. Therefore, there is no room for the CPU of the Xbox 360 to make any anticipation or try and prevent stalls. This means that effectively, the system is reliant on how effective the compiler is – and there is no compiler on earth that can be perfect and eliminate all of the possible issues that can occur in the CPU pipeline
In rather stark contrast lies the Durango CPU. These cores execute fully out of order (known as OOO), but also called data order. The processor is able, while executing a sequence of instructions, to re-order the micro-operations (not the x64 instructions) via an internal 64-entry re-order buffer (ROB). This improves performance by:
- Starting loads and stores as early as possible to avoid stalls.
- Executing instructions in data-dependency order.
- Fetching instructions from branch destination as soon as the branch address is resolved.
It’s worth noting that there will be a number of other benefits that aren’t so easy to quantify too. These are mostly to do with how familiar developers are with the X86/X64 architecture. It will allow them to ‘hit the ground running’. Although it is worthy to note that the Xbox 360’s CPU was certainly easier to get the most out compared to that of the Playstation 3’s Cell. The PS3 required the full use of the SPU’s (which, by the way also could be used to perform some of the duties that the Xbox 720 Durango’s Move Engines do).