ok, so I’m a newbie and therefore most probably the performance may not be that odd, and it’s just that I don’t think about it the way I should.
So, the program I’m running, calls a kernel that uses 60 registers/thread. I’m trying to understand how reducing the registers, and increasing occupancy affects the overall performance.
The GPU I am executing the program on is a tesla c1060 which has 30 SMs, and therefore (for 1024 threads/SM) can have up to 30720 threads active. The kernel operates on 1-D arrays that size 40000 elements (not even 2 times the total number of potentially active threads).
The odd performance I notice is that for a given number of dedicated registers/thread (say 16) , I notice that higher occupancy doesn’t always lead to better performance. For the example of 16 regs/thread, I have 78% occupancy when the block size is 160 and 100% occupancy when block size is 128. The performance I get for 78% occupancy is better. Note that I use 300b of shared memory per block which doesn’t affect the occupancy percentages at all.
any explanations, or even ideas are mostly welcome.
Thank you in advance for any answers