Need help on registers and current CUDA hardware architecture Got any reference for the current CUDA

I checked the whitepaper of the new cuda architecture. Inside it is mentioned that 16 load/store units are used in each shader core which would do the loading&storing of data to/from the cache/DRAM. The problem I’m having is: where are the registers? I also read in a few other slides/tutorial/conference presentation that in the current architecture registers are off-chip and there’s no caching for them. So if they are off-chip in what way are they made fast? Does operations on registers go through the load/store units as well?

Another problem i’m having is that: fermi now extends the number of shader cores in each SM to 32 instead of 8 in the current architecture. On Fermi, it makes perfect sense that a SM can do 32 flops in 1 clock cycle. But how does the SM in the current architecture do 32 flops (1 instruction for 1 warp of threads) with 8 shader cores? I’d really appreciate it if someone could help me find some reference to the current hardware architecture.

btw, PTX is a virtual machine assembly language, right? Is PTX translated to the actual assembly code run on GPU in a one-to-one manner? or is the PTX optimized first and then translated? I’m just looking for some way to really allow me to understand the hardware architecture of CUDA, and I think understanding its assembly code would be a good start, if there’s no good reference for the current architecture(for fermi, the whitepaper is not too bad a reference, but fermi seems to be quite different from the current architecture).

The register file(s) are on-chip.

In the current architecture, each instruction effectively gets issued 4 times across the 8 cores of a multiprocessor to service a warp of 32 threads.

Yes, PTX is a virtual machine language. There is JIT compilation of PTX to native machine code in the driver, and that includes optimization. There has been experimental evidence posted here of instruction reordering and other optimizations in the final machine code compared to PTX.

If you are interested in the GT200 this article gives a very thorough and accurate overview.

I’m pretty sure it is impossible (not to mention a performance disaster) to put registers off-chip. The compiler does have the option of spilling some register contents temporarily to “local memory” (which is really just global memory) that is off-chip. Many kernels do not need local memory, though, since there are so many registers available. It is only very complex kernels, or if you tell the compiler explicitly to limit register use, that can force some temporary values in local memory rather than registers. (The GT200 architecture has 16,384 registers per multiprocess on-chip, and the Fermi whitepaper says there will be 32,768 registers per multiprocessor which I’m sure have to also be on-chip.)

You are correct that there is no caching for the DRAM, at least in the case of thread local memory or normal global memory. If the DRAM is accessed through the texture units, there is a small amount of cache on current chips. With the addition of L1 and L2 cache in Fermi, local memory will be much faster, so register spills will not be nearly as costly as they are now.

I think you misunderstood the FLOP counting on the current chips. There are 8 stream processors per SM with a throughput of one warp every 4 clocks for simple instructions. One of those instructions is a fused multiply-add (a * b + c), which gives you two floating point operations in one instruction. Thus, the performance for a SM currently can be counted as 16 floating point operations per clock cycle. (There is the possibility of dual-issuing a floating point multiply with the fused multiply-add, allowing up to 24 floating point operations per clock cycle.) With Fermi, the warp throughput will be 2 warps every 2 clock cycles, and with the fused multiply-add operation, that gives you 64 floating point operations per clock cycle.

Yes, some code transformation is applied at the PTX level. If you look at the PTX output from NVCC now, you’ll notice that the code uses static single assignment form, which makes it look like the number of registers required per thread is huge. ptxas performs the final register assignment, and possibly other optimizations before generating a “cubin”, which is the native instruction set of whatever NVIDIA GPU generation you are targeting. The driver can also perform this PTX → cubin transformation, which allows nvcc to just embed the PTX in your binary and improve portability as well. (This is the suggested approach as mentioned in the Fermi compatibility guide)

So who is right??? or is avidday saying that each instruction effectively gets issued 4 times in 4 clock cycles?