help me understand `odd' performance

ok, so I’m a newbie and therefore most probably the performance may not be that odd, and it’s just that I don’t think about it the way I should.

So, the program I’m running, calls a kernel that uses 60 registers/thread. I’m trying to understand how reducing the registers, and increasing occupancy affects the overall performance.

The GPU I am executing the program on is a tesla c1060 which has 30 SMs, and therefore (for 1024 threads/SM) can have up to 30720 threads active. The kernel operates on 1-D arrays that size 40000 elements (not even 2 times the total number of potentially active threads).

The odd performance I notice is that for a given number of dedicated registers/thread (say 16) , I notice that higher occupancy doesn’t always lead to better performance. For the example of 16 regs/thread, I have 78% occupancy when the block size is 160 and 100% occupancy when block size is 128. The performance I get for 78% occupancy is better. Note that I use 300b of shared memory per block which doesn’t affect the occupancy percentages at all.

any explanations, or even ideas are mostly welcome.
Thank you in advance for any answers

Have you tried using the CUDA Visual Profiler (cudaprof.exe, comes with the CUDA toolkit) to see what the GPU is doing. You should be able to find out things like how many coalesced and uncoalesced memory accesses are happening, for example.

The performance gain beyond ~50% occupancy can be marginal to non-existent, all depending on the memory access pattern. So if the compiler has to do something bad (like spilling automatic variables to memory) in order to keep the register count low, you can easily lose performance by increasing occupancy.

You can check whether register spilling occurs by running nvcc with the --ptxas-options=-v option. If a number for lmem appears, local variables have been allocated in local memory (which really just is part of main memory).

Thank you very much for your responses!

actually I am running on linux, so if you have any idea which is the equivalent of that in my case please do share :-)

/usr/local/cuda/computeprof/bin/computeprof

N.

thank you very much! :-)