I am currently optimizing a CUDA kernel and I realized that I’m not sure how many registers are needed for certain statements (function parameters, variables, while, for, if, nested arithmetic expressions…). Is there a way to be certain or at least some rule of thumb?
Write a few example statements and compile with ‘–ptxas-options=-v’. It should give you a pretty good idea of how the allocator is working.
Thank you, this is really helpful. I have one other basic question regarding this. If my device supports a total of 32768 registers per block, does it mean there are always 32768 registers available for each block or does this depend on the number of blocks or something else?
my kernel requires 31 registers
my device provides 32768 reg/block
each block contains 768 threads
i.e. each thread gets 32768/768=41 registers
So I might assume here that the number of registers is not a bottleneck of my application?
Register allocation is always determined during compilation, and it (partly) determines how many blocks will be active on a given multiprocessor at a time. So the per multiprocess calculation looks more like:
Your kernel requires 31 registers per thread
Your request 768 threads per block
Therefore one block requires 768 * 31 = 23308 registers
You have a per MP limit of 32768 registers, therefore only 1 block can be active at once on a given MP
I say more like, because in reality things are a bit more complex than that. Registers are assign to blocks in pages, so there is some rounding up to page size, and other resources like shared memory also can constrain how many blocks will be active. NVIDIA offer an occupancy calculation spreadsheet which lets you play with execution parameters and kernel resource requirements and see how they interact.
Thank you, this will definitely help me to adjust the kernel. I have one more question though. When compiling with -arch=sm20 I get the following output:
What is the cmem[n] referring to? Local, constant and texture memory?
The cmem is constant memory. In a compute 2.x device, that will be the arguments to you kernel, which are stored in constant memory (in older architectures shared memory was used). The numbers in parenthesis are probably bank numbers, although that is just a guess.