Hi,
I’m trying to figure out why Cuda occupancy calculator and the profile log shows occupancy to be 0.5 while executing the Cuda kernel. The cubin file shows
lmem = 28
smem = 88
reg = 26
bar = 0
My Grid is (500,1) and Block is (500,1,1). So, I have 500 threads per block and a total of 500 blocks. Firstly I’m not sure why it shows local memory is being used. If the number of registers is 26 and an active warp contains 32 threads, shouldn’t the total register counter would be 26 * 32 = 832 which is much less than 16384 registers in tesla board? As per the documentation, number of active warps per multiprocessors in 32 and hence the register count should be 32 * 32 * 26 = 26624 and it spills into the local memory. But does the MP store the state of 1024 active threads ? Again I don’t have more than 500 threads per block at the first place.
I’m trying to use constant memory to reduce the number of registers as there are many arguments to the kernel but it does not seem to change the register usage by much.
Spilling to lmem also takes place when u use an array in kernel and indexing is not constant until . In your case registers count is high and smem is low.
Reducing register usage and using smem instead might be of help.
I’m not using any array in the kernel. It’s only a bunch of variables and they are accessing the global memory arrays which are passed to the kernel. I’ve tried to reduce the register count by reducing the number of variables and using constant memory but it does not help as the register count only goes down to 24 still making total number of registers for 1024 threads greater than the maximum number of registers.
Now, I have two questions:
Is there a general way to reduce the register count. I have tried to hand optimize the code as well but in vein :( Would using shared memory array in place of local variables decrease the register count? Also, I am passing 9 global memory addresses to the kernel. Can I pass those addresses using constant memory array?
Occupancy would only matter if the MP is stalling for memory transactions to finish. I have profiled the number of memory transactions and number of instructions executed. Is there a way to figure out if the MP is idle or not from these info. Or, should I profile other stuff to get an idea whether the kernel is memory-bound or compute-bound?