Cuda Occupancy and Register usage

shibdas · June 3, 2009, 6:46pm

Hi,
I’m trying to figure out why Cuda occupancy calculator and the profile log shows occupancy to be 0.5 while executing the Cuda kernel. The cubin file shows

lmem = 28
smem = 88
reg  = 26
bar  = 0

My Grid is (500,1) and Block is (500,1,1). So, I have 500 threads per block and a total of 500 blocks. Firstly I’m not sure why it shows local memory is being used. If the number of registers is 26 and an active warp contains 32 threads, shouldn’t the total register counter would be 26 * 32 = 832 which is much less than 16384 registers in tesla board? As per the documentation, number of active warps per multiprocessors in 32 and hence the register count should be 32 * 32 * 26 = 26624 and it spills into the local memory. But does the MP store the state of 1024 active threads ? Again I don’t have more than 500 threads per block at the first place.

I’m trying to use constant memory to reduce the number of registers as there are many arguments to the kernel but it does not seem to change the register usage by much.

Any clarification would be highly appreciated.

Thanks

eyalhir74 · June 3, 2009, 7:06pm

Hi,
I'm trying to figure out why Cuda occupancy calculator and the profile log shows occupancy to be 0.5 while executing the Cuda kernel. The cubin file shows
lmem = 28
smem = 88

reg  = 26

bar  = 0
My Grid is (500,1) and Block is (500,1,1). So, I have 500 threads per block and a total of 500 blocks. Firstly I’m not sure why it shows local memory is being used. If the number of registers is 26 and an active warp contains 32 threads, shouldn’t the total register counter would be 26 * 32 = 832 which is much less than 16384 registers in tesla board? As per the documentation, number of active warps per multiprocessors in 32 and hence the register count should be 32 * 32 * 26 = 26624 and it spills into the local memory. But does the MP store the state of 1024 active threads ? Again I don’t have more than 500 threads per block at the first place.

Hi,

First I guess the 500 threads is not that good of a number, you should try to use a 32 multiple. If you search the forums I think

you’ll find a reference to this (and whether or not its that important) from MisterAnderson42.

As for the lmem usage, this doesn’t happen just because registers spills to lmem. Certain functions use lmem by default, such

as sin,cos and the like, if I remember correctly.

The best approach I’ve come up with is to start from an empty kernel and open up few lines of code at a time and see

which lines consumes registers/lmem and then try to optimize it. You can also try to use shared memory instead of registers

if that applicable.

Keep in mind the dead-code optimizer in this process :)

hope this helps,

eyal

cvnguyen · June 3, 2009, 8:22pm

You’ve used 26 x 500 = 13000 registers per multiprocessor.

dlmeetei · June 4, 2009, 4:56am

Hi,
I'm trying to figure out why Cuda occupancy calculator and the profile log shows occupancy to be 0.5 while executing the Cuda kernel. The cubin file shows
lmem = 28
smem = 88

reg  = 26

bar  = 0
My Grid is (500,1) and Block is (500,1,1). So, I have 500 threads per block and a total of 500 blocks. Firstly I’m not sure why it shows local memory is being used. If the number of registers is 26 and an active warp contains 32 threads, shouldn’t the total register counter would be 26 * 32 = 832 which is much less than 16384 registers in tesla board? As per the documentation, number of active warps per multiprocessors in 32 and hence the register count should be 32 * 32 * 26 = 26624 and it spills into the local memory. But does the MP store the state of 1024 active threads ? Again I don’t have more than 500 threads per block at the first place.

I’m trying to use constant memory to reduce the number of registers as there are many arguments to the kernel but it does not seem to change the register usage by much.

Any clarification would be highly appreciated.

Thanks

Spilling to lmem also takes place when u use an array in kernel and indexing is not constant until . In your case registers count is high and smem is low.

Reducing register usage and using smem instead might be of help.

dlmeetei · June 4, 2009, 10:26am

dlmeetei · June 4, 2009, 10:26am

shibdas · June 11, 2009, 4:04am

I’m not using any array in the kernel. It’s only a bunch of variables and they are accessing the global memory arrays which are passed to the kernel. I’ve tried to reduce the register count by reducing the number of variables and using constant memory but it does not help as the register count only goes down to 24 still making total number of registers for 1024 threads greater than the maximum number of registers.

Now, I have two questions:

Is there a general way to reduce the register count. I have tried to hand optimize the code as well but in vein :( Would using shared memory array in place of local variables decrease the register count? Also, I am passing 9 global memory addresses to the kernel. Can I pass those addresses using constant memory array?
Occupancy would only matter if the MP is stalling for memory transactions to finish. I have profiled the number of memory transactions and number of instructions executed. Is there a way to figure out if the MP is idle or not from these info. Or, should I profile other stuff to get an idea whether the kernel is memory-bound or compute-bound?

Topic		Replies	Views
Occupancy doesn't tally with calculator CUDA Programming and Performance	3	1715	January 17, 2009
CUDA Occupancy Calculator Helps pick optimal thread block size CUDA Programming and Performance	76	313128	September 13, 2011
question about register and performance CUDA Programming and Performance	3	6800	September 22, 2008
how to reduce the number of registers CUDA Programming and Performance	5	9058	July 8, 2010
Multiproccesor occupancy for a flop intensive kernel .. how bad is 25% ? CUDA Programming and Performance	10	1902	June 30, 2009
help me understand `odd' performance CUDA Programming and Performance	5	1784	June 18, 2010
NVCC chooses to use local memory while there is a lot of registers it can use CUDA Programming and Performance	10	1821	January 7, 2022
occupancy and performance also a question about .cubin files CUDA Programming and Performance	6	2354	December 9, 2009
register pressure CUDA Programming and Performance	5	986	November 17, 2010
Occupancy Calculation in check but still 'out of resource' error. CUDA Programming and Performance	4	3108	November 15, 2009

Cuda Occupancy and Register usage

Related topics