In the advent of the new architecture -Kepler-, and the release of some details of the changes Nvidia architects have made in this iteration, I realized how limited my knowledge is about the details of the architecture, be it Tesla or Fermi.
The programming guide and other guides provide a good programming model for CUDA, and they do touch the architecture sometimes, at some points, but apparently not often enough and clearly not at the lower levels.
I know some details are Nvidia secrets, but I assume most of these are manufacturing techniques (I could be horribly wrong and I do have the feeling I am), in any case, some architecture information is open to the public, or is now at least.
So the question: Where can I find reliable (official?) detailed description of the Tesla and Fermi architectures and the difference of the models say GF100, GF104 GF110 GF114, their relation to the compute capability, plus the difference between compute oriented cards and gaming oriented ones.
I know there’s the Fermi whitepaper, but it gave me more question than it answered questions. Thanks.
I mostly want to know the specific hardware difference of each update like the GF1xx family, some of which are used in GeForce cards and in Tesla cards. For example the GTX 480 has a GF100 and the Tesla C2050 has also a GF100 GPU, so are these two cards the same?? And again, what changes happened between GF100 and GF104 (the GTX 460) and how does this have any effect on the double precision capabilities of these cards. Speaking of which, how is double precision done exactly in the GPU, how did it evolve from the 1.3 compute capability to 2.0 and 2.1?
There’s also the issue of actual warp mapping to CUDA cores which is not entirely clear to me and it seems to be changing quite often. I mean the 1.x cards had 8 cores per SM, so perhaps every quarter of a warp would be mapped to the cores at any time, then with 2.0 the SM had 32 cores but it appears that, with 2 warp schedulers, half a warp is assigned to the cores, that makes some sense with the 2.1 as the SMs in these have 48 cores and they can handle 3 half warps. Now this picture is not clear to me, and I am not sure how it is extended to the Kepler SMX, it looks like the entire warp can now be simultaneously be mapped to cores.
I would also want to know about the time/clock cycle each type of instruction takes, and how integer and floating point operations are done simultaneously (because it appears to be the case) how things are pipelined (if they are…).
I have more question about the way the cache works, how things work between the global memory and the L2 cache, and between the L2 and L1.
These small things, although not entirely necessary to write CUDA programs, I think are quite important in making some design decisions, if nothing else just to feel comfortable with some level of knowledge about the hardware on which the code is going to run, but really I think it does help with the code’s compatibility and flexibility if not also with its efficiency.
So please, if you know of any reliable source that could provide explanations to any of the above points please let me know.
OK. I see you’re curious about GPU architecture like me. (I double majored in computer science & computer architecture).
I believe they’re the same chip. They are both capable of 1/2 double precision rate (Geforce double precision is capped) and have ECC capability? I can’t confirm, but my reasoning is that they both sucked large amounts of power. The Tesla version has 2 DMA copy engines. I actually used a C2050 board and this was useful when you wanted to read back large amounts of data to the CPU instead of just a single number.
The GF104 is only used in Geforce cards, so NVIDIA is free to completely gut the double precision rate, which they did (look at the instruction throughput chart in the CUDA C Programming Guide - 1/12). They basically increased cores per SM to amortize the SM fixed costs better. They also changed the schedulers to allow 2 instructions to be issued from the same warp in 1 cycle (dual issue) to compensate for the higher cores/scheduler ratio. I don’t see how knowing how the warps are mapped to the cores is useful. You only really care about the throughput. NVIDIA is likely to do a good job to ensure such architectural nuances have minimal effect on software performance.
If you’re talking about the throughput, it’s listed in the CUDA C programming guide. The latencies are approximately 24 cycles (I believe Kepler reduced the latency significantly to reduce the # threads needed, and hence improve energy efficiency). Instruction execution is definitely pipelined because waiting 24 cycles would kill the throughput. At each cycle, the warp schedulers choose a warp that’s ready to execute. This could be the same warp it issued from before, or another warp. One detail that I would like to know is if operand forwarding is implemented. That is, if an instruction is dependent on the previous instruction, does it have to wait the full 24 cycles for the result of the dependent instruction to be written to the register file, or can it start as soon as possible? I’m guessing due to the latency insensitive nature of GPUs and to keep things simple, operand forwarding isn’t implemented.
I think you’re talking about Fermi’s multi-precision floating point units? I’m interested in that too. There was a publication describing such an approach. Just look up, “A New Architecture For Multiple-Precision Floating-Point Multiply-Add Fused Unit Design.” You basically decompose the extended precision multiply into the sum of 2 partial products. They claim you can build multiprecision units with only a small premium (e.g. 15% extra area and delay). This seems to be a huge area efficiency win. GK104 doesn’t have multiprecision units (double math rate is 1/24 of single precision), but you probably can expect to see them on the GPGPU oriented GK110.
edit[/b]: ah sorry i just noticed you’ve already read these, but imo this is all the info you need about cuda cores.
GF104/114 is based on a cheaper version, boosted with texture fillrate, and max 8compute units (CU), GK104 Kepler uses similar approach, with the same 8CU but they’re pimped so in theory it equals those 16CU in GF110… but in reality its a bit different story, most of the time its crippled in direct compute compared to 480/580gtx (15/16CU)…expect where that single precision floating point kicks in.
GF100/GF110 is high end chip with 15/16CU can do 4 texel FP16 instead of 2 like by GF100 and few speedup tricks, also new power limiter.
Tesla/Quadro uses the same up to 16CU like GF110 chip, at least current models.
I didnt see any Kepler edition yet, probably because there is no high end GK110 Kepler External Image
Hey! This is all quite helpful, thank you ‘i SPY’‘Uncle Joe’. I will look into these information, and try to reconcile the different raging models of the GPU in my head.
As I said before, even if there can be no (direct) benefit from knowing a couple of things about the GPU, it still feels kind of empowering to know about what’s going on under the hood. About knowing how the threads are mapped to the cores, I simply find it surprising that they could change the number of cores so easily and manage them in a way to completely hide its effects from the user. It’s very interesting to me. For example, for one my CUDA programs, single precision ran faster on the GTX 460 than a Tesla C2050, I can no longer reproduce the phenomenon, since the Tesla machine got updated to the newer CUDA versions, but it made me want to know the reason for it, as it was not expected at all… Also, in my mental model, it seems the ld/st units are too few to feed the cores, but could explain (although I am not convinced) why it would take 24 cycles to resolve a floating point operation, although its supposed to be done very quickly (2 cycles or something), so even if the operands depend on a previous instruction, I can’t see where the cycles come from, load/store is too slow? what is happening?
I want to be able to explain many of the strange behaviors I get with my code, and often what I know is not enough, so I am looking to better understand the GPU. These small stuff just make me uncomfortable writing CUDA code confidently.
I hope I am getting my point across properly, and I’d be surprised if there aren’t other people feeling the same way.
Sounds like you’re unfamiliar with computer architecture. In that case, Hennessey and Patterson’s book can show you the way.
Because there are probably ~12 stages in the instruction pipeline (remember that pre Kepler chips ran the ALUs at 2x the base clock). Having that many stages is just a reality needed to achieve GhZ speeds. Floating point operations are complicated, hence many steps
1. Fetch the instruction and operands. Maybe ~3 cycles (base clock).
2. Perform arithmetic (or memory access)
a. extract the significands & exponents of the floating point numbers
b. align the significands (only for add/subtract) or add the exponents (multiply only)
c. do the multiply or add
d. pack the result into floating point form
3. Write results back to register file
For most programs, only 1 out of 5 instructions is a load/store.
I have taken a course in computer architecture based on that booked :), that is why it sounded weird to me when I read on several occasions that the cores could push 1 floating point operation per 1 or 2 cycles, and I had missed the part in the guide which talks about the latency of the operations, and that they are pipelined, and I supposed when people say an operation is done every 1-2 cycles they implicitly refer to that very pipeline system. On thing I am not sure about now, is the difference between double and single throughput, of course its related to the number of cores capable of doing the operations, but per such core or unit, I have read that the ratio of the throughput of a single precision unit to a double precision unit is 2:1, in Fermi, and in Kepler now its 1:1. This is confusing because both unit would have to have pipelines and one of the points of a pipeline is to be able to give a result every cycle, perhaps I understood or misread something so if you could shed some light on this matter it would be great… By the way, “Uncle Joe”, you mentioned this earlier
it would indeed be interesting to see how things are actually done.
I am not exactly familiar about the actual implementation in hardware of these computation units and until you mentioned that article I thought we had reached an absolute best in making integer and floating point operation computing hardware, with optimal transistor usage with power efficiency, but that doesn’t seem to be the case! I thought the only difference is the choice of the floating point representation, which Nvidia seems to have definitively moved to the newest IEEE standards. In any case, its good to hear that there is some headroom for improvement in there.
About
I think you’re referring to having the “optimal” amount of loads vs computation, that would be 1 load for every 4-5 operations. Are the load/store units scalar, are they only for registers? I always thought of them as being just scalar units, so this is what I understand now: instead of having many load/store units that can do the load or store operation in one shot, and have them idle while most of the program doesn’t use it making it a big waste of transistors and space, have fewer of them that will take some time to deliver the data, but once there the data can be operated on by many cores, and their choice is just a point in the balance they deem optimal (in general), but its in hardware and its practically forcing the same balance on all software that will run on it… I don’t know if that’s the best thing to do, I find bandwidth limitation to be a real problem with my programs, and the fixes I make aren’t getting it as near the peak as I would like it to be…
When they say a float operation takes 1 cycle, that just means you can issue a new operation to the pipeline every cycle, not that it takes 1 cycle to complete.
No, a 1:1 ratio would be ridiculous. Just think about the bandwidth requirements. Double operations take twice as much bandwidth, which is at a premium in the power limited realm we live in. Why waste 1/2 the bandwidth just to make double performance the same as for single?
A pipeline can stall to do an operation over multiple cycles to deal with hardware resource limitations. I did this when implementing an int multiply instruction. The execution stage of the pipeline just stalls the entire pipeline until the multiply is complete. I’m sure the main reason double operations are done at half the throughput is because they would take twice as long to read the operands from the register file.
What is your application and what performance are you currently getting?
I am not sure what to make of it. What does “that block [FP64 capable] of CUDA cores could execute FP64 instructions at a rate of ¼ FP32 performance” mean? What is this actually comparing, actual throughput with pipeline usage or as you say issuing a new operation? This is why I want to see official well explained documentation of these things. This is also why it would help to know what is going on between the load/store unit, registers and the cores when an instruction is issued, and how things are mapped to them. I can’t see why issuing double precision instruction would take 4 more cycles than issuing a single precision instruction on the previous generation.
This is something that has me bewildered for a while, it could be off topic, but I can’t situate it so it could be the right place: I use a typedef to switch the execution of my program between double and single precision, the Nsight analysis tool shows that on the Tesla C2050 even when I want single precision, a good amount of double precision happens but much less than single precision ops, and when double is specified, a good amount of single precision ops are reported, but less than the double ops. I have looked into my code and I am not sure how this is happening, compiling for old architecture (1.2) doesn’t warn me about double not being supported, so this prompted me into looking into this kind of detail.
For my application some analysis tool once told me the average bandwidth of my kernels on the Tesla C2050 was around 45 GB/s, I am not sure how much this can be trusted, but I added extra flops done in the registers, a several fold increase in flops only increased the time taken by a fraction of that, so we’re really memory bound. Without ECC this card can easily do more than 100 GB/s. I suppose computation and memory read aren’t overlapping as much as they ideally can, the thing is I can’t mess with the order. How can one measure the achieved bandwidth without tools? Is it as simple as counting the global reads and writes performed by each thread and divide by the execution time? Some accesses will be cached and it will not be clear at the level of the CUDA code, so how would one measure the bandwidth achieved?
Do you have floating point constants in your code? Double precision constants in expressions will silently promote the evaluation to double precision, even if the variables involved are single precision.
Also, the behavior of the compiler is such that if you compile for an architecture that does not support double precision (like compute capability 1.2), then all of the double precision variables and constant are converted down to single precision. There should be a compiler message saying that this has happened, but it is not treated as an error by the compiler.
I do have a couple of constants which are used in single precision only operation and the results are always stored in single precision variables. Do I need to explicitly cast these into single precision floats? These constants are very simple just “0.5”, no need to go to double to do save precision. Unfortunately I do not have access to the Tesla machine right now, I will try explicit casts tomorrow.
However, I am not sure how these constants explain the single precision operations being done when I force the use of double precision. All in all the system behaves as expected, exactly doubling the time of execution when doing double precision as opposed to single.
The compiler considers 0.5 to be a double precision constant, and 0.5f to be a single precision constant (as pasoleatis says). You do not need to add a cast for single precision constants, you just have to write the constant with a “f” suffix.
I explicitly cast and put an f to all apparent constants that appear in floating point expressions, even at comparison operations. so a 0.5 became (real)0.5f, now the analysis tool tells me no double operation were registered when running on real = float, but there’s still some small single precision operation happening when real = double. I suppose I am missing some part of the code.
In any case, having only single precision really gave a boost to the performance! 2.3 ms to 1.8, but now double precision takes more than twice the time, almost three times, I suppose its natural, as data transfer is limiting the compute performance. The analysis tools says that the warp issue stall reasons are Instruction fetch, Execution Dependency and Data request in increasing importance.
Does this happen with your programs? What can I do to decrease execution dependency?
I am sorry this topic is going in another direction from what it was meant to be, but the reason I wanted to know more about the architectures is really to know how to explain and fix the performance problems I am getting. Thank you all External Image
So, any chance of getting some difference description between Tesla cards and GeForce cards, and the changes in the fermi iterations (GF100,110,104,114,…)?
This is the difference between Tesla and GeForce (GF110/110, specifically) for the Fermi generation of cards:
The main differences between GF100/110 and GF104/114 can be found in the description of compute capability 2.0 vs. 2.1. I also tend to sift through Anandtech reviews of new GPUs for some high level architecture details. I don’t know where they get there information from, though. (Perhaps special briefings from NVIDIA when new products come out.)