I’m trying to figure out why Cuda occupancy calculator and the profile log shows occupancy to be 0.5 while executing the Cuda kernel. The cubin file shows
lmem = 28 smem = 88 reg = 26 bar = 0
My Grid is (500,1) and Block is (500,1,1). So, I have 500 threads per block and a total of 500 blocks. Firstly I’m not sure why it shows local memory is being used. If the number of registers is 26 and an active warp contains 32 threads, shouldn’t the total register counter would be 26 * 32 = 832 which is much less than 16384 registers in tesla board? As per the documentation, number of active warps per multiprocessors in 32 and hence the register count should be 32 * 32 * 26 = 26624 and it spills into the local memory. But does the MP store the state of 1024 active threads ? Again I don’t have more than 500 threads per block at the first place.
I’m trying to use constant memory to reduce the number of registers as there are many arguments to the kernel but it does not seem to change the register usage by much.
Any clarification would be highly appreciated.