how many registers are needed for my kernel Is there a short explanation how to count the number of

xrismf · January 21, 2011, 6:02pm

Hello,
I am currently optimizing a CUDA kernel and I realized that I’m not sure how many registers are needed for certain statements (function parameters, variables, while, for, if, nested arithmetic expressions…). Is there a way to be certain or at least some rule of thumb?
Kind Regards

Gregory_Diamos · January 21, 2011, 6:11pm

Write a few example statements and compile with ‘–ptxas-options=-v’. It should give you a pretty good idea of how the allocator is working.

xrismf · January 22, 2011, 2:36pm

Thank you, this is really helpful. I have one other basic question regarding this. If my device supports a total of 32768 registers per block, does it mean there are always 32768 registers available for each block or does this depend on the number of blocks or something else?

For example:

my kernel requires 31 registers
my device provides 32768 reg/block
each block contains 768 threads
i.e. each thread gets 32768/768=41 registers

So I might assume here that the number of registers is not a bottleneck of my application?

avidday · January 22, 2011, 2:48pm

Register allocation is always determined during compilation, and it (partly) determines how many blocks will be active on a given multiprocessor at a time. So the per multiprocess calculation looks more like:

[list=1]

[*]Your kernel requires 31 registers per thread

[*]Your request 768 threads per block

[*]Therefore one block requires 768 * 31 = 23308 registers

[*]You have a per MP limit of 32768 registers, therefore only 1 block can be active at once on a given MP

I say more like, because in reality things are a bit more complex than that. Registers are assign to blocks in pages, so there is some rounding up to page size, and other resources like shared memory also can constrain how many blocks will be active. NVIDIA offer an occupancy calculation spreadsheet which lets you play with execution parameters and kernel resource requirements and see how they interact.

xrismf · January 22, 2011, 9:29pm

Register allocation is always determined during compilation, and it (partly) determines how many blocks will be active on a given multiprocessor at a time. So the per multiprocess calculation looks more like:

[list=1]

[*]Your kernel requires 31 registers per thread

[*]Your request 768 threads per block

[*]Therefore one block requires 768 * 31 = 23308 registers

[*]You have a per MP limit of 32768 registers, therefore only 1 block can be active at once on a given MP

I say more like, because in reality things are a bit more complex than that. Registers are assign to blocks in pages, so there is some rounding up to page size, and other resources like shared memory also can constrain how many blocks will be active. NVIDIA offer an occupancy calculation spreadsheet which lets you play with execution parameters and kernel resource requirements and see how they interact.

Thank you, this will definitely help me to adjust the kernel. I have one more question though. When compiling with -arch=sm20 I get the following output:

What is the cmem[n] referring to? Local, constant and texture memory?

avidday · January 22, 2011, 9:45pm

The cmem is constant memory. In a compute 2.x device, that will be the arguments to you kernel, which are stored in constant memory (in older architectures shared memory was used). The numbers in parenthesis are probably bank numbers, although that is just a guess.

xrismf · January 22, 2011, 9:58pm

External Image