I am still new to CUDA and I have a kernel which takes most of my computing time on the GPU, the NVIDIA Profiler measures 33% Occupancy with a register count of 50.
I can see why the register count is so high, but i have no idea how to reduce the amount of used registers. Is there anything I can do?
Welcome to the CUDA forums, Puffski!
In case you are not yet familiar with out new forum clown: Don’t bother following his advice. The compiler of course does it’s own register allocation, so such changes would not have any effect on the generated code.
And the old one’s duly reporting here too!! External Image
Well, the first thing you could try is of course the -maxregcount argument. Try -maxregcount 32 to limit the number of registers used per thread to 32.
This will probably introduce some register spillage. You can use shared memory to store some variable if it’s not used during a large interval, and then load it into a variable when you need to use the value again. This would help you bypass the L2 used by local memory, which is the default place for spilled registers.
I see that the 4 aa_bb variables would fit nicely into 16 bytes(2 double) of shared memory.
I guess with some reordering of your code things could be optimized further, but I’m having a hard time reading your code.
EDIT: try inlining the fabs functions. It produces longer code, but also gives the compiler greater freedom to optimize.
You’re using the index variables i and j a lot. Try declaring them as volatile int instead of just int. It forces the compiler to assign them to a register immediately. This works a treat on Compute Capability 1.x - I can’t guarantee that it will help also in your case.