Strange register allocation a new way to keep register usage low =)

AndreiB · November 9, 2007, 11:42am

I’ve got very interesting results when playing with and compiling different kernels.

My kernel begins with series of repetitive blocks of statements (i.e. unrolled loop) like this:

t = q + g_nArray1[11];

q = __umulhi( t, g_nConst );

r = t - q*g_nLength;

aBlock[5] =  g_wArray2[ r ] << 16;

Let’s say there are 10 such blocks. After that aBlock array is processed and results are written to memory.

To maximize performance I need to keep register usage below 16 and that number seems to be more than enough. But compiled kernel requires above 20 registers! And adding one more block of code increases total register usage by 2 or 3 (although I see no reason for this).

Now, the funny part.

If I change code above to this one:

if( g_nTrue )

{

    t = q + g_nArray1[11];

    q = __umulhi( t, g_nConst );

    r = t - q*g_nLength;

    aBlock[5] =  g_wArray2[ r ] << 16;

}

then register usage becomes much lower (only 12 registers for 10 such pieces of code). g_nTrue is constant which is set to 1.

Am I missing something or there’s something wrong with register allocation algorithm in ptxas?

Unfortunately I can’t post full code here but if someone from NVIDIA is interested I can email it, just let me know.

P.S> And one more quick question for NVIDIA employees: how can I get information about our Register Developer Application status? We’ve filed two applications almost a month ago and haven’t got any reply or feedback. ‘Site Feedback’ form in Developer’s Zone also seems to forward messages to /dev/null…

wumpus · November 9, 2007, 12:10pm

Very interesting, so in this case flow control actually lowers the amount of registers. I’ve also seen that repeating the same piece of code sometimes increases the number of registers, which is kind of weird.

Altough it might make sense if there is instruction re-ordering going on. For example on intel architectures it is good to group instructions accessing memory together, especially if they are independent. This conditional statement makes such regrouping impossible. Are you seeing any actual speedups?

In my experience, most tricks to lower the amount of used registers only slow down things, even though the occupancy can increase.

AndreiB · November 9, 2007, 12:23pm

Yes, I see significant speedup.
If I set block size to 512 then speedup is about 15-20%.
If I keep block size of 256 then speedup is more than 25%. This is one more point which is not very easy to understand…

BTW, wumpus, thanks a lot for decuda. Great tool.

asadafag · November 9, 2007, 12:53pm

I had some experiences with

if(threadIdx.x<0)__syncthreads();

This may serve as an instruction reordering fence. It once managed to squeeze a 36 regs kernel down to 32.

paulius · November 9, 2007, 10:48pm

The speedup with 256 threads per block is better probably because you’re achieving higher occupancy - 512 treads/block allow only 512 active threads per multiprocessor, while 256 allow 768 (768 is the max). Higher occupancy helps with latency hiding.

Paulius

Morph208 · November 12, 2007, 1:43am

Hum interesting… Regarding your Register Developer Application status, did you use Firefox? Because I believe there is a bug. Try to fill in the application form online from IE. I know it’s sad, but that should work this time…

AndreiB · November 12, 2007, 5:33am

Nope, I’ve used IE7.
Strange bug, anyway )

Topic		Replies	Views
Register Allocation Woes Strange register allocation CUDA Programming and Performance	4	2044	June 6, 2008
Use of register An odd problem CUDA Programming and Performance	12	2291	August 12, 2010
Increasing register usage without decreasing occupancy drops speed dramatically CUDA Programming and Performance	3	964	May 24, 2011
Register allocator overload CUDA Programming and Performance	2	3210	February 10, 2009
Forcing register reuse in a loop CUDA Programming and Performance	9	3213	March 6, 2010
Why should the number of registers be kept low CUDA Programming and Performance	5	1018	December 14, 2011
Register demand CUDA Programming and Performance	2	2717	September 9, 2009
too large kernel solutions CUDA Programming and Performance	11	4281	September 2, 2008
Reducing the number of registers To improve occupancy CUDA Programming and Performance	5	4667	April 5, 2007
how to reduce the number of registers CUDA Programming and Performance	5	8902	July 8, 2010

Strange register allocation a new way to keep register usage low =)

Related topics