help me understand `odd' performance

eotakos · June 13, 2010, 8:35am

ok, so I’m a newbie and therefore most probably the performance may not be that odd, and it’s just that I don’t think about it the way I should.

So, the program I’m running, calls a kernel that uses 60 registers/thread. I’m trying to understand how reducing the registers, and increasing occupancy affects the overall performance.

The GPU I am executing the program on is a tesla c1060 which has 30 SMs, and therefore (for 1024 threads/SM) can have up to 30720 threads active. The kernel operates on 1-D arrays that size 40000 elements (not even 2 times the total number of potentially active threads).

The odd performance I notice is that for a given number of dedicated registers/thread (say 16) , I notice that higher occupancy doesn’t always lead to better performance. For the example of 16 regs/thread, I have 78% occupancy when the block size is 160 and 100% occupancy when block size is 128. The performance I get for 78% occupancy is better. Note that I use 300b of shared memory per block which doesn’t affect the occupancy percentages at all.

any explanations, or even ideas are mostly welcome.
Thank you in advance for any answers

chippies · June 13, 2010, 3:56pm

Have you tried using the CUDA Visual Profiler (cudaprof.exe, comes with the CUDA toolkit) to see what the GPU is doing. You should be able to find out things like how many coalesced and uncoalesced memory accesses are happening, for example.

tera · June 14, 2010, 12:39am

The performance gain beyond ~50% occupancy can be marginal to non-existent, all depending on the memory access pattern. So if the compiler has to do something bad (like spilling automatic variables to memory) in order to keep the register count low, you can easily lose performance by increasing occupancy.

You can check whether register spilling occurs by running nvcc with the --ptxas-options=-v option. If a number for lmem appears, local variables have been allocated in local memory (which really just is part of main memory).

eotakos · June 14, 2010, 6:43am

Thank you very much for your responses!

actually I am running on linux, so if you have any idea which is the equivalent of that in my case please do share :-)

Nico · June 14, 2010, 7:59am

/usr/local/cuda/computeprof/bin/computeprof

N.

eotakos · June 18, 2010, 1:57pm

thank you very much! :-)

Topic		Replies	Views
question about register and performance CUDA Programming and Performance	3	6746	September 22, 2008
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5965	July 25, 2007
Trying to improve occupancy CUDA Programming and Performance	2	1145	May 11, 2010
CUDA Occupancy Calculator accuracy? CUDA Programming and Performance	3	7498	March 26, 2007
Reducing the number of registers To improve occupancy CUDA Programming and Performance	5	4759	April 5, 2007
Occupancy doesn't tally with calculator CUDA Programming and Performance	3	1663	January 17, 2009
Occupancy Mystery lo-occ,hi-reg faster than hi-occ,lo-reg? CUDA Programming and Performance	7	5321	September 25, 2008
register pressure CUDA Programming and Performance	5	887	November 17, 2010
occupancy and performance also a question about .cubin files CUDA Programming and Performance	6	2276	December 9, 2009
Cuda Occupancy and Register usage CUDA Programming and Performance	6	5896	June 11, 2009

help me understand `odd' performance

Related topics