Given Nvidia’s decisions before GamesCom to show off Ray Tracing with the Volta architecture, it was clear that the company were betting big on the new technology. During the show, Nvidia CEO Jensen Huang was keen to stress that Turing was the biggest leap in GPUs since their Tesla series, debuting back in late 2006 and for PC gamer’s at least, was perhaps best known as the heart of the GeForce 8 series of graphics cards.
The GeForce 8 series of cards turned graphic card design on their head – no longer were there separate vertex and pixel shaders. Instead, there were ‘Unified Shaders’ designed to accomplish a myriad of tasks, dictated to by the programmers and game engine. Render workloads can change on a dime, but now Nvidia’s own designers and Game Developers no longer had to worry about render workloads per frame of vertex work versus pure pixel shading, no – instead the GPU could do as necessary. It also opened up the doors to GPU computing and CUDA, the act of running a variety of tasks on the GPU itself.
GPUs are great at parallel computation – think of a modern CPU. How many processor cores do you have right now? Are you the owner of a Ryzen 7 CPU? If so, 8 CPU cores and 16 processor threads. Perhaps you even just spent the cash and grabbed a Ryzen Threadripper 2990WX CPU, and just seeing the 32 cores and 64 threads show up in task manager of Windows provides you profound glee. Well, imagine having about 100x more processors than that. That’s what a modern day GPU does, sure, its shaders aren’t so great at making the same decisions as a CPU – but when you’re looking to simply crush through data, there’s little that’s able to stand up to a modern day GPU.
But with Turing, there’s two distinct new components inside the GPU – the first of which is the Tensor Cores, and the second are the Ray Tracing cores. In a nutshell, the Tensor Cores are there to run AI and Neural Networks, designed to process vast amounts of AI calculations. RT Cores are there to calculate Ray Tracing, by figuring out where certain rays of light (which are cast into scenes) would intersect with an on-screen pixel.
Ultimately, these two new components power much of the new talked about features inside Nvidia’s new cards (a technology known as RTX). RTX isn’t just Ray Tracing, but also technology such as DLSS (Deep Learning Super Sampling). For months there was conjecture if we’d see RT cores inside of Nvidia’s GeForce 20 cards – and an even bigger question was if we’d see Tensor Cores. Tensor cores were the realms of GPUs such as Volta – cards worth several thousand dollars (the cheapest end ‘consumer’ card was the Titan V, which retailed for a bargain price of 3K USD).
So let’s start out with the very basics of Ray Tracing; I’ve already done a video on this (which you can checkout in the video description) but this video also covers things from a slightly different angle (so to speak).
Nvidia aren’t looking to replace traditional rasterized images anytime soon; to paraphrase Jensen Huang during his GamesCom presentation – GPUs are really good at it. You can have a lot of ‘stuff’ going on in parallel and it its fast. While everything looks 3d to us, because everything has perspective and depth, in reality the scene is actually rendered and then converted into a 2d image to be displayed on our screens. There’s a lot of stuff which goes on (and this isn’t a video going into how that happens), but just know that the GPU draws objects on screen using geometry, figures out what objects are being clipped by others, discards anything that is hidden and then applies various textures and post processing effects to render the game, then finally after the image is constructed it is sent to the GPU.
When targeting a frame rate of 30fps, the game needs to average 33.33 ms (so it has to do this 30 times per second). If you’re targeting 60 FPS, the GPU has halve the time to accomplish the same task – 16.67 ms and so on. Higher resolutions increase the number of pixels which need to be processed (example – going from 1080P to 1440P is 2.25x the number of pixels for the GPU to render). This increases the work on the GPU shaders, the amount of data being held in VRAM, the amount of data being shunted around the card and so on. So larger numbers of pixels, and more detailed pixels (ie, increasing the texture resolution or other levels of detail) therefore impact frame rate as you’re asking the GPU to do more ‘stuff’ while trying to maintain a stable frame rate. So the more complex a scene the more taxing it is – so a forest scene is gonna eat up more GPU power than if you’re playing the same game but with your face pressed up against a wall. If you want more information on this, we put out a rather in-depth analysis back around the time of the release of the then next generation consoles.
Resolution | Total pixel count | Size difference vs the previous resolution |
1280 x 720 | 921,600 Pixels | |
1408 x 792 | 1,115,136 Pixels | 1.21x more pixels than 720P |
1600 x 900 | 1,440,000 Pixels | 1.29x more pixels than 792P |
1920 x 1080 | 2,073,600 Pixels | 1.44 more pixels than 900P |
2560 x 1440 | 3,686,400 | 1.78 more pixels than 1080P |
3840 x 2160 | 8,294,400 | 2.25 more pixels than 1440P; 4x more than 1080P! |
Okay – great, now I’ve told you the basics of how a GPU ‘used’ to work. So how does Ray Tracing (or specifically Nvidia’s RTX technology) come into this? Well, remember how I just told you that rasterized images are essentially 2d, and that a 3d world is projected into this and that ‘things’ that aren’t there are thrown away or not rendered? Okay great.
But in the ‘real world’ that’s not how physics and light work. If you can’t see an object directly (because its behind you) but you’re facing a mirror or another reflective surface you can still see that object. Well, in a game world – that’s where it gets really tricky. Just like Microsoft’s GDC Ray Tracing demo back in 2018, which demonstrated this with a ship and SSR (Screen Space Reflections).
You have seen reflections before in games of course, and modern day reflections are likely using a technique like SSR. It does what is already in a scene and reflects it back on whatever surface. That’s great – IF that object is within the field of view of your camera. So with the example here, Microsoft used a ship and pointed out the sails would indeed reflect just fine in the water… but anything not in the ‘cone’ of the cameras view wouldn’t. And that’s not right – the flag just is totally missing.
Ray Tracing is the act of trying to calculate how light (and therefore shadows) would appear in the real world, by throwing in thousands of rays of light into a scene and then figuring out ‘where’ when a pixel would be interacted with by that ray of light. The problem with this approach (and you’ll know this if you’ve ever run certain PC benchmarks) is that this technique is very expensive in terms of time. Sure, if you’re rendering a movie for Hollywood or you’re even creating a 3d animation in your own home, it doesn’t matter if you’re taking a minute to produce one frame a second, but to a gamer… well yeah.
https://www.youtube.com/watch?v=KJRZTkttgLw
So that’s what the Ray Tracing cores ‘do’ inside the Turing architecture. Let’s take the TU102 GPU as an example – oh, and this is the full ‘fat’ core too, not the slightly watered down GPU that’s found inside the Geforce RTX 2080 Ti.
You’ll see there are 4,608 CUDA cores, 72 SM (Streaming MultiProcessor cores) and finally 72 Ray Tracing Cores (we’ll get to the Tensor cores in a few). So that means that each SM contains 64 CUDA cores (those SMs have their own caches and so on, we’ve done an article of that if you want more info) and also an RT core per SM too. For those who’re not gonna be purchasing the full blown TU102 GPU (in other words, say a Quadro RTX 8000), and instead picking up the GeForce RTX 2080 Ti, you’ll see a cut down to 68 SMs, and 68 RT Cores.
According to the official specs from Nvidia, the RT cores of the TU102 is capable of throwing up 10 Giga Rays per second. If you notice the on screen presentation of Jensen Huang, we see a lot of info regarding the GPU.
Keep our focus on the RT core though, there’s RTI (Ray Triangle Intersection) and BVH (Bounding Volume Hierachy). Starting out with RTI, which is based on Möller–Trumbore intersection algorithm. It is a method of calculating the intersection of a ray of light and a triangle mesh in 3d space.
This is combined with Bounding Volume Hierarchy. You can think of these as a ‘box’ within a box and form a ‘tree’ of different boxes. Think of these ‘boxes’ as a way to contain certain objects within a scene. If you have 3 boxes (for example) and you shoot a ray of light at a scene, but only box 2 says ‘yep, that’s me’ you can discard the other two boxes. Then likewise, that box will contain further sub boxes until eventually the area where a ray of light intersects a pixel and RTI can begin.
Although details aren’t fully released yet for Turing, it’s highly likely that when the GPU does this, it can then use logic to know that there will be additional rays which follow the same path, and it will know that similar calculations can be performed. According to developers, the RT cores are largely programmable, but with BVH and Triangle intersection being fixed function. When the RT cores ‘have the results’ it then creates the appropriate workloads as WARPS which will the run on the general purpose CUDA cores of the RTX graphics cards.
In a nutshell, the RT Cores are pretty much a dedicated pipeline which calculate the rays of light and triangle intersection and feed that information to the rest of the GPU. While this might change, there’s a lot of discussion right now regarding the performance of Ray Tracing on games (Shadow of the Tomb Raider, Battlefield…) and how developers are targeting 60 FPS at 1080P.
The fact of the matter is, this technology is still in its infancy – when talking about real time graphics. Movie render times would generally be considered in hours for a single frame of animation, the very fact we’re seeing Turing push games at real time performance levels is impressive. Doubtlessly this’ll be something that improves over time.
Nvidia have also added in Tensor Cores with the Turing architecture, and one of the more surprising announcements of the show was they remain largely intact in all of the currently announced GeForce RTX 20 SKUs. The 576 Tensor Cores of the Quadro RTX 8000 received a slight cut, down to 544 of the GeForce RTX 2080 Ti, but still. It’s a clear demonstration Nvidia are planning to do a lot of work with Deep Learning and Neutral Networks on the cards.
This cut has reduced the performance of the Quadro RTX 8000 and RTX 6000 from 125 TFLOPS, 250 TOPS INT 8 and 500 TOPS INT4 to 110 tflops FP16, 220 TOPS INT8 and 440 TOPS INT4 of the RTX 2080 Ti.
One of the first areas we’re seeing this is DLSS (Deep Learning Super Sampling) which has been demoed by Nvidia at Gamescom. Nvidia have also shown off the now notorious benchmark, showing that the RTX 2080 will put out about double the performance of its predecessor, the GTX 1080, when DLSS is being used. Without DLSS, Nvidia are currently claiming we’ll be seeing about a 50 percent improvement in games.
So, what is DLSS then? Well, Deep Learning Super Sampling leverages the performance of Nvidia’s Tensor Cores to run a neural network to improve the image quality using lower resolution samples. Nvidia have been doing a lot in the area of denoising and upsampling images over the past years, so it isn’t totally surprising we see it in the usage of their gaming.
Deep Learning and Neural Networks are a pretty complex topic (and yes, I’m making the understatement of the century) and so the finer points of their inner workings isn’t something I’m going to tackle here. But, Neural Networks work can work with either training, or inference.
Training is for the AI to actually ‘learn’ how to do a task – and does so by using a large set of data and then figuring out how something should be, or what it should look like, or what pattern its looking for – and so on. So let’s say you are showing it 10k photos of cats the AI will get really good at saying “Okay, so these are the characteristics of a cat” and will start to recognize difference breeds, shapes, sizes and colors. And of course, if it gets it wrong you give it a “NO!” and it continues on and on. You can read more about this at Nvidia’s official page here.
But eventually, if you show it a cat sitting next to a dog, or a cat sitting on a sofa, the neural network will no longer mistake your recliner for your cat, Spot, and you’re good.
Training ideally requires an awful lot of power – an awful lot. High performance super computers crunch through the data way faster, and of course you’re dealing with huge amounts of data – lots of RAM, lots of processing power.
But then you’ve trained the network, you can leverage that to run on smaller devices by means of Inference. And that is really what DLSS is doing, using your home Turing cards Tensor Cores to run the patterns that have been trained on using the big super computers at Nvidia.
Unfortunately, some of the finer details of how DLSS works has yet to be confirmed. But, it appears that the GPU renders a frame of animation at a lower resolution (example, 1080P) then the tensor cores will run a neural network which upsamples that into a higher resolution.
We can assume that this isn’t done as a post process, (so the CUDA cores / other parts of the GPU render the frame of animation, then the tensor cores get hold of it). We can likely assume that these events are being done in parallel.
This does mean that Nvidia will need to ‘train’ specifically the neural network with each game. There’s also a lot of questions as to the performance of it, and how it will be impacted with other tasks which use the tensor cores.
When upsampling an image – (so if you were to blow up an image that’s native 1080P) to a much higher resolution, you start introducing ‘noise’ into the photo if doing so with traditional techniques. This is because you’re not really ‘adding’ additional details, you’re simply increasing the pixel count to what’s there. So noise in the original image and any imperfections get magnified (plus other issues). Nvidia’s technology though can identify the problems created with the ‘noise’ and realize that it shouldn’t be there because its been training on a noiseless and a noisy data set.
We can assume therefore that Nvidia’s likely trained the AI in the drivers at ultra high resolutions (like say 4K or higher) and then tested it with a lower resolution (say 1080P). So now the AI is good at understanding the subtle nuances of a games artistic styling, how character models are supposed to appear and so on.
According what we’ve seen of the Unreal Engine 4 Infiltrator Demo (which actually first came to light back in 2013, just when the new generation of consoles were released. Ironically enough, there was a big deal made about that time, because this actually was missing the advanced SVOGI advanced lighting technique, because the PS4 and Xbox One weren’t capable of running it, Epic removed it from UE4), performance essentially doubles. From an average of 35 FPS on an Nvidia GeForce GTX 1080 Ti to about 70FPS of the RTX 2080 Ti. Impressive indeed, and of course feeds into Nvidia’s claims of double the performance of Pascal – if DLSS is being used.
You might also look at the ‘die shots’ Nvidia showed off during GamesCom, as they attempted to illustrate how data was passed around the GPU and you would be forgiven thinking ‘well, this part has the RT cores, tensor cores go here” but this isn’t true. Below is a single Volta SM, and what we see contained within.
64 FP32 cores
64 INT32 cores
32 FP64 cores
8 Tensor Cores
4 texture units
It wasn’t fully confirmed with these details (and that they were the same as Volta) but since the slide from the press deck leaked early, we now have confirmation that the number of Tensor Cores is 544 – so if we do the math of 544 and divide that by 68 (the number of SM’s we know comprise the RTX 2080 TI’s 4,352 CUDA cores) we can therefore know that the Turing layout is pretty darn similar to that of Volta, including confirmation from Nvidia themselves that just like Volta, Turing has separate FP32 and Integer cores.
So rather than thinking of the tensor cores, RT Cores and CUDA cores as separate ‘bits’ on the GPU, instead understand that each SM contains these these components. It would appear that given this knowledge we can make the following deductions for the RTX 2080 TI (PER EACH of the 68 SM):
64 FP32 CORES
64 INT 32 CORES 8 Tensor Cores
1 RT Core
8 Tensor Cores
4 Texture Units
We’ll do a deeper dive of the actual architecture of the SMs and other components of Turing soon, but from what we can understand given the leaks and available information, Turing is a tweaked and improved version of Volta. Concessions have been made on the cache system (particularly L1 cache and shared data cache size), but we still get the major cache system improves of Volta. We see the larger memory bandwidth compared to Pascal, the separation of FP and Int cores in SMs – and so on.
Well, hopefully you’ve found this rather informative, and do stick with us and we’ll continue to delve into the Turing architecture and of course provide benchmarks when its released.