I’ve got very interesting results when playing with and compiling different kernels.
My kernel begins with series of repetitive blocks of statements (i.e. unrolled loop) like this:
t = q + g_nArray1[11];
q = __umulhi( t, g_nConst );
r = t - q*g_nLength;
aBlock[5] = g_wArray2[ r ] << 16;
Let’s say there are 10 such blocks. After that aBlock array is processed and results are written to memory.
To maximize performance I need to keep register usage below 16 and that number seems to be more than enough. But compiled kernel requires above 20 registers! And adding one more block of code increases total register usage by 2 or 3 (although I see no reason for this).
Now, the funny part.
If I change code above to this one:
if( g_nTrue )
{
t = q + g_nArray1[11];
q = __umulhi( t, g_nConst );
r = t - q*g_nLength;
aBlock[5] = g_wArray2[ r ] << 16;
}
then register usage becomes much lower (only 12 registers for 10 such pieces of code). g_nTrue is constant which is set to 1.
Am I missing something or there’s something wrong with register allocation algorithm in ptxas?
Unfortunately I can’t post full code here but if someone from NVIDIA is interested I can email it, just let me know.
P.S> And one more quick question for NVIDIA employees: how can I get information about our Register Developer Application status? We’ve filed two applications almost a month ago and haven’t got any reply or feedback. ‘Site Feedback’ form in Developer’s Zone also seems to forward messages to /dev/null…
Very interesting, so in this case flow control actually lowers the amount of registers. I’ve also seen that repeating the same piece of code sometimes increases the number of registers, which is kind of weird.
Altough it might make sense if there is instruction re-ordering going on. For example on intel architectures it is good to group instructions accessing memory together, especially if they are independent. This conditional statement makes such regrouping impossible. Are you seeing any actual speedups?
In my experience, most tricks to lower the amount of used registers only slow down things, even though the occupancy can increase.
Yes, I see significant speedup.
If I set block size to 512 then speedup is about 15-20%.
If I keep block size of 256 then speedup is more than 25%. This is one more point which is not very easy to understand…
The speedup with 256 threads per block is better probably because you’re achieving higher occupancy - 512 treads/block allow only 512 active threads per multiprocessor, while 256 allow 768 (768 is the max). Higher occupancy helps with latency hiding.
Hum interesting… Regarding your Register Developer Application status, did you use Firefox? Because I believe there is a bug. Try to fill in the application form online from IE. I know it’s sad, but that should work this time…