After some extensive profiling of a large workhorse kernel, I see that my register usage is 109 per CUDA thread. This is the main thing nvvp complains about, as the memory reads at least seem to be coalesced.
This is using a Tesla K20c with a max_reg_count of 255 I believe.
The nvvp profiling stage of ‘kernel memory’ reports ‘no issues’, but the ‘multiprocessor’ step states ‘Occupancy may be limited by register usage’. Usually my CUDA kernels uses registers in the range 25-50 so this is new situation.
Each thread in a kernel does quite a bit of work, so this really does not surprise me, but I am wondering if this is a case where setting the max_register value or using launch_bounds may be helpful.
In general the kernel is running in good time, so not really interested in dis-assembly, but if I can get at least 20% better performance it would be worth the re-factor.
Should I go down that road?