Cuda Occupancy and Register usage

I’m trying to figure out why Cuda occupancy calculator and the profile log shows occupancy to be 0.5 while executing the Cuda kernel. The cubin file shows

lmem = 28
smem = 88
reg  = 26
bar  = 0

My Grid is (500,1) and Block is (500,1,1). So, I have 500 threads per block and a total of 500 blocks. Firstly I’m not sure why it shows local memory is being used. If the number of registers is 26 and an active warp contains 32 threads, shouldn’t the total register counter would be 26 * 32 = 832 which is much less than 16384 registers in tesla board? As per the documentation, number of active warps per multiprocessors in 32 and hence the register count should be 32 * 32 * 26 = 26624 and it spills into the local memory. But does the MP store the state of 1024 active threads ? Again I don’t have more than 500 threads per block at the first place.

I’m trying to use constant memory to reduce the number of registers as there are many arguments to the kernel but it does not seem to change the register usage by much.

Any clarification would be highly appreciated.



First I guess the 500 threads is not that good of a number, you should try to use a 32 multiple. If you search the forums I think

you’ll find a reference to this (and whether or not its that important) from MisterAnderson42.

As for the lmem usage, this doesn’t happen just because registers spills to lmem. Certain functions use lmem by default, such

as sin,cos and the like, if I remember correctly.

The best approach I’ve come up with is to start from an empty kernel and open up few lines of code at a time and see

which lines consumes registers/lmem and then try to optimize it. You can also try to use shared memory instead of registers

if that applicable.

Keep in mind the dead-code optimizer in this process :)

hope this helps,


You’ve used 26 x 500 = 13000 registers per multiprocessor.

Spilling to lmem also takes place when u use an array in kernel and indexing is not constant until . In your case registers count is high and smem is low.

Reducing register usage and using smem instead might be of help.

I’m not using any array in the kernel. It’s only a bunch of variables and they are accessing the global memory arrays which are passed to the kernel. I’ve tried to reduce the register count by reducing the number of variables and using constant memory but it does not help as the register count only goes down to 24 still making total number of registers for 1024 threads greater than the maximum number of registers.

Now, I have two questions:

  1. Is there a general way to reduce the register count. I have tried to hand optimize the code as well but in vein :( Would using shared memory array in place of local variables decrease the register count? Also, I am passing 9 global memory addresses to the kernel. Can I pass those addresses using constant memory array?

  2. Occupancy would only matter if the MP is stalling for memory transactions to finish. I have profiled the number of memory transactions and number of instructions executed. Is there a way to figure out if the MP is idle or not from these info. Or, should I profile other stuff to get an idea whether the kernel is memory-bound or compute-bound?