How to reduce register usage

cbuchner1 · May 26, 2010, 11:20pm

Haha! “schedsucks” is a great name for this test case.

To me it is clear where the the need for 29 registers comes from. It has to keep all
the values of A[1] to A[29] in registers - you’re lucky it did not move the entire array
to local memory. So far I thought large arrays always go to local memory.

I have no clue why it would use twice the amount of registers that I think it should.
So 58 is definitely not cool.

Christian

Gregory_Diamos · May 26, 2010, 11:46pm

I have no idea either… I just ran it through Ocelot (the first version), and I get 31 registers. I had thought that ptxas was doing something like linear scan allocation (same as ocelot), but maybe it is something even simpler? And Ocelot’s version doesn’t even do back-to-back spill/fill elimination, it is really as basic as you can get.

Edit: I get 2 registers for the second merged version using ocelot…

Gregory_Diamos · May 26, 2010, 11:46pm

I have no idea either… I just ran it through Ocelot (the first version), and I get 31 registers. I had thought that ptxas was doing something like linear scan allocation (same as ocelot), but maybe it is something even simpler? And Ocelot’s version doesn’t even do back-to-back spill/fill elimination, it is really as basic as you can get.

Edit: I get 2 registers for the second merged version using ocelot…

Gregory_Diamos · May 27, 2010, 12:05am

Well if the name is meant to imply that the instruction scheduler sucks, you aren’t really giving it a chance with that example. It requires you to hoist loads before previous stores, which in general violates sequential consistency from the perspective of a single thread.

That is one of the hardest optimizations to do in a compiler, you have to prove that result[i] never aliases result[i + n] for any value of n in [i+1,30) and result is a pointer. It is probably possible in this example (both the load and store reference the same value for a given iteration), but the type of analysis required for doing this proof automatically is very involved.

Gregory_Diamos · May 27, 2010, 12:05am

Well if the name is meant to imply that the instruction scheduler sucks, you aren’t really giving it a chance with that example. It requires you to hoist loads before previous stores, which in general violates sequential consistency from the perspective of a single thread.

That is one of the hardest optimizations to do in a compiler, you have to prove that result[i] never aliases result[i + n] for any value of n in [i+1,30) and result is a pointer. It is probably possible in this example (both the load and store reference the same value for a given iteration), but the type of analysis required for doing this proof automatically is very involved.

wwa · May 27, 2010, 12:06am

Actually I think I know why.

ptxas tries to group memory fetches to cover latency by pipelining. It’s actually quite smart, or at least it tries to be.

See my paper in this [post=“1059596”]post[/post]. In all compiled examples memory fetches are grouped.

It does, massively, break occupancy at times but the idea is good. Too bad there’s no control over it.

Edit: quote added

wwa · May 27, 2010, 12:06am

Actually I think I know why.

ptxas tries to group memory fetches to cover latency by pipelining. It’s actually quite smart, or at least it tries to be.

See my paper in this [post=“1059596”]post[/post]. In all compiled examples memory fetches are grouped.

It does, massively, break occupancy at times but the idea is good. Too bad there’s no control over it.

Edit: quote added

wwa · May 27, 2010, 12:11am

Edit: Sorry I misread your post, you are right, of course.

Edit2: Actually I didn’t misread it, you edited it :P. Well, I’m less sorry in that case, but you’re still right.

wwa · May 27, 2010, 12:11am

Edit: Sorry I misread your post, you are right, of course.

Edit2: Actually I didn’t misread it, you edited it :P. Well, I’m less sorry in that case, but you’re still right.

Gregory_Diamos · May 27, 2010, 1:10am

Yeah, I had to go back and look at the example to realize that result was the input, not A. Sorry for the confusion, the edit feature makes it less likely that I will proofread before posting :)

Thanks for the link. That was a very interesting read. I’m not sure if the behavior that you observed can be explained by covering pipeline latency though. It seems like the latency of a load should dwarf that of an integer instruction. In your optimized example, there are 6 integer instructions to cover the latency of the last load compared to the broken pipelining example where there are 3. It almost seems like the loads can issue back-to-back as in an out of order processor unless there is a register dependent on the load.

Could you run the following experiments?

// A optimized with stalls

add.b32   $r0,s [ 0x0010 ],0 x00000004 

add.b32   $r1,s [ 0x0010 ],0 x0000000c 

mov.b32   $r2,s [ 0x0010 ]			   

add.b32   $r3,s [ 0x0010 ],0 x00000008 

mov.u32   $r0,g [ $r0 ]				 

mov.u32   $r10,$r0		 

mov.u32   $r1,g [ $r1 ]				 

mov.u32   $r11,$r1		 

mov.u32   $r2,g [ $r2 ]				

mov.u32   $r12,$r2		 

mov.u32   $r3,g [ $r3 ]				 

mov.u32   $r12,$r3		 

shl.u32   $r0,$r0,0x00000006		   

shl.u32   $r1,$r1,0x00000006

add.b32   $r2,$r0,$r2

add.b32   $r0,s [ 0x0010 ],0 x00000010

add.b32   $r3,$r1,$r3

add.u32   $r3,$r2,$r3

mov.u32   g [ $r0 ],$r3

// B reduced pipelining with stalls											 

add.b32   $r0,s [ 0 x0010 ],0 x00000004

mov.u32   $r2,g [ $r0 ]				 

mov.u32   $r12,$r2		 

shl.u32   $r2,$r2,0 x00000006

mov.u32   $r1,g [ $r0 ]

mov.u32   $r11,$r1		 

add.b32   $r0,$r2,$r1

add.b32   $r1,s [ 0 x0010 ],0 x0000000c

mov.u32   $r2,g [ $r1 ]

mov.u32   $r12,$r2		 

shl.u32   $r2,$r2,0 x00000006

add.b32   $r1,s [ 0 x0010 ],0 x00000008

mov.u32   $r3,g [ $r1 ]

mov.u32   $r13,$r3		 

add.b32   $r1,$r2,$r3

add.b32   $r3,$r0,$r1

add.b32   $r2,s [ 0 x0010 ],0 x00000010

mov.u32   g [ $r2 ],$r3

I am curious whether the difference is due to pipelining (in which case A should outperform B ) or due to issuing loads back to back (if so they should perform similarly)

Gregory_Diamos · May 27, 2010, 1:10am

Yeah, I had to go back and look at the example to realize that result was the input, not A. Sorry for the confusion, the edit feature makes it less likely that I will proofread before posting :)

Thanks for the link. That was a very interesting read. I’m not sure if the behavior that you observed can be explained by covering pipeline latency though. It seems like the latency of a load should dwarf that of an integer instruction. In your optimized example, there are 6 integer instructions to cover the latency of the last load compared to the broken pipelining example where there are 3. It almost seems like the loads can issue back-to-back as in an out of order processor unless there is a register dependent on the load.

Could you run the following experiments?

// A optimized with stalls

add.b32   $r0,s [ 0x0010 ],0 x00000004 

add.b32   $r1,s [ 0x0010 ],0 x0000000c 

mov.b32   $r2,s [ 0x0010 ]			   

add.b32   $r3,s [ 0x0010 ],0 x00000008 

mov.u32   $r0,g [ $r0 ]				 

mov.u32   $r10,$r0		 

mov.u32   $r1,g [ $r1 ]				 

mov.u32   $r11,$r1		 

mov.u32   $r2,g [ $r2 ]				

mov.u32   $r12,$r2		 

mov.u32   $r3,g [ $r3 ]				 

mov.u32   $r12,$r3		 

shl.u32   $r0,$r0,0x00000006		   

shl.u32   $r1,$r1,0x00000006

add.b32   $r2,$r0,$r2

add.b32   $r0,s [ 0x0010 ],0 x00000010

add.b32   $r3,$r1,$r3

add.u32   $r3,$r2,$r3

mov.u32   g [ $r0 ],$r3

// B reduced pipelining with stalls											 

add.b32   $r0,s [ 0 x0010 ],0 x00000004

mov.u32   $r2,g [ $r0 ]				 

mov.u32   $r12,$r2		 

shl.u32   $r2,$r2,0 x00000006

mov.u32   $r1,g [ $r0 ]

mov.u32   $r11,$r1		 

add.b32   $r0,$r2,$r1

add.b32   $r1,s [ 0 x0010 ],0 x0000000c

mov.u32   $r2,g [ $r1 ]

mov.u32   $r12,$r2		 

shl.u32   $r2,$r2,0 x00000006

add.b32   $r1,s [ 0 x0010 ],0 x00000008

mov.u32   $r3,g [ $r1 ]

mov.u32   $r13,$r3		 

add.b32   $r1,$r2,$r3

add.b32   $r3,$r0,$r1

add.b32   $r2,s [ 0 x0010 ],0 x00000010

mov.u32   g [ $r2 ],$r3

I am curious whether the difference is due to pipelining (in which case A should outperform B ) or due to issuing loads back to back (if so they should perform similarly)

wwa · May 27, 2010, 1:41am

I never wrote it can be explained by “covering pipeline latency” ;)

I wrote that it can be explained by covering memory latency by pipelining accesses to it. That is: by pipelining memory fetches, so they are all in flight at the same time.

Your two examples do break pipelining (and they perform equally, 3200+ cycles) - the next instruction after fetch is dependent on it, so one memory fetch must complete before the next one starts. In examples optimized for maximum pipelining all 4 memory fetches are in flight at the same time (kernel does not block on them).

Edit: actually I think I need to rewrite/reword my article, as it’s the second time it caused that exact same confusion ;)

wwa · May 27, 2010, 1:41am

I never wrote it can be explained by “covering pipeline latency” ;)

I wrote that it can be explained by covering memory latency by pipelining accesses to it. That is: by pipelining memory fetches, so they are all in flight at the same time.

Your two examples do break pipelining (and they perform equally, 3200+ cycles) - the next instruction after fetch is dependent on it, so one memory fetch must complete before the next one starts. In examples optimized for maximum pipelining all 4 memory fetches are in flight at the same time (kernel does not block on them).

Edit: actually I think I need to rewrite/reword my article, as it’s the second time it caused that exact same confusion ;)

Gregory_Diamos · May 27, 2010, 2:10am

Sorry, you’re right, I misinterpreted what you wrote.

Thanks for running the code. That’s a very interesting (undocumented?) capability of the hardware that I would not have expected (why do you need to support concurrent accesses from the same thread when different warps can issue concurrent accesses as well?). That means that tesla has to track per register dependencies in order to determine that the second load is not waiting for the address to be produced from the previous load and that they have to have something like MSHRs on a per-thread/warp basis to handle multiple outstanding memory requests from the same thread… That is walking a fine line between in-order and out-of-order execution.

So from a register allocation perspective, you want schedule code such that independent loads are scheduled back-to-back until you run out of MSHRs or completely cover the memory latency, then start scheduling based on def-use distance to reduce the register pressure. It seems like ptxas is doing the former but not the latter…

Gregory_Diamos · May 27, 2010, 2:10am

Sorry, you’re right, I misinterpreted what you wrote.

Thanks for running the code. That’s a very interesting (undocumented?) capability of the hardware that I would not have expected (why do you need to support concurrent accesses from the same thread when different warps can issue concurrent accesses as well?). That means that tesla has to track per register dependencies in order to determine that the second load is not waiting for the address to be produced from the previous load and that they have to have something like MSHRs on a per-thread/warp basis to handle multiple outstanding memory requests from the same thread… That is walking a fine line between in-order and out-of-order execution.

So from a register allocation perspective, you want schedule code such that independent loads are scheduled back-to-back until you run out of MSHRs or completely cover the memory latency, then start scheduling based on def-use distance to reduce the register pressure. It seems like ptxas is doing the former but not the latter…

wwa · May 27, 2010, 2:32am

I got another idea that I screwed up the benchmarks, because fetches are from consecutive addresses, so they could be coalesced (hence one memory transaction).

But if I change addresses so that they for sure can’t coalesce, the execution time is still low (~1300) in the “pipelined” version and high in the “broken”.

So that’s a nice piece of hardware and a bit less nice piece of a compiler. ;)

Edit: just saw your edit.

That’s an easy question. Because you can’t run enough warps to fully cover memory latency. Memory latency is 400+ cycles. Instruction latency is 4 cycles, so you need 400/4=100+ warps per SM to fully cover gmem latency. GT200 can only do 8 active blocks per SM. A block can be at most 512 threads (and that’s unlikely). So 8512/32=128 warps (active blocksblock_size/warp_size). That’s barely the 100+ needed.

wwa · May 27, 2010, 2:32am

I got another idea that I screwed up the benchmarks, because fetches are from consecutive addresses, so they could be coalesced (hence one memory transaction).

But if I change addresses so that they for sure can’t coalesce, the execution time is still low (~1300) in the “pipelined” version and high in the “broken”.

So that’s a nice piece of hardware and a bit less nice piece of a compiler. ;)

Edit: just saw your edit.

That’s an easy question. Because you can’t run enough warps to fully cover memory latency. Memory latency is 400+ cycles. Instruction latency is 4 cycles, so you need 400/4=100+ warps per SM to fully cover gmem latency. GT200 can only do 8 active blocks per SM. A block can be at most 512 threads (and that’s unlikely). So 8512/32=128 warps (active blocksblock_size/warp_size). That’s barely the 100+ needed.

wanderine · May 27, 2010, 7:14am

In general I think no reliable (and sane at the same time) method exists. You need to stick to maxrregcount until NVIDIA fixes and/or gives us more control over instruction scheduling and register allocation.

Just for the fun of messing with ptxas, look at the kernel (results for toolkit 2.3):

[codebox]global void schedsucks(uint32_t* result){
#define N 30

uint32_t A[N];

#pragma unroll

for(int i=1;i<N;i++){

	A[i]=result[i];

}

#pragma unroll

for(int i=1;i<N;i++){

	A[i]^=i;

}

#pragma unroll

for(int i=1;i<N;i++){

	result[i]=A[i];

}
}[/codebox]

That one compiles to :

ptxas info : Used 58 registers, 16+16 bytes smem, 4 bytes cmem[0], 116 bytes cmem[1]

And register usage scales with N.

But if you merge the loops:

[codebox]

global void schedsucks(uint32_t* result){
#define N 30

uint32_t A[N];

#pragma unroll

for(int i=1;i<N;i++){

	A[i]=result[i];

	A[i]^=i;

	result[i]=A[i];

}
}

[/codebox]

then suddenly it compiles it to:

ptxas info : Used 2 registers, 16+16 bytes smem, 4 bytes cmem[0], 116 bytes cmem[1]

It’s not that strange, in the second case the compiler realizes that you don’t need the A variable at all, since you simply can write result[i] ^= result[i]…

Your array does not go into local memory since the array indices are known at compile time (thanks to the loop unrolling), without the loop unrolling the indices are not known at compile time and the array will go into local memory (i.e. global memory which is really slow).

wanderine · May 27, 2010, 7:14am

In general I think no reliable (and sane at the same time) method exists. You need to stick to maxrregcount until NVIDIA fixes and/or gives us more control over instruction scheduling and register allocation.

Just for the fun of messing with ptxas, look at the kernel (results for toolkit 2.3):

[codebox]global void schedsucks(uint32_t* result){
#define N 30

uint32_t A[N];

#pragma unroll

for(int i=1;i<N;i++){

	A[i]=result[i];

}

#pragma unroll

for(int i=1;i<N;i++){

	A[i]^=i;

}

#pragma unroll

for(int i=1;i<N;i++){

	result[i]=A[i];

}
}[/codebox]

That one compiles to :

ptxas info : Used 58 registers, 16+16 bytes smem, 4 bytes cmem[0], 116 bytes cmem[1]

And register usage scales with N.

But if you merge the loops:

[codebox]

global void schedsucks(uint32_t* result){
#define N 30

uint32_t A[N];

#pragma unroll

for(int i=1;i<N;i++){

	A[i]=result[i];

	A[i]^=i;

	result[i]=A[i];

}
}

[/codebox]

then suddenly it compiles it to:

ptxas info : Used 2 registers, 16+16 bytes smem, 4 bytes cmem[0], 116 bytes cmem[1]

It’s not that strange, in the second case the compiler realizes that you don’t need the A variable at all, since you simply can write result[i] ^= result[i]…

Your array does not go into local memory since the array indices are known at compile time (thanks to the loop unrolling), without the loop unrolling the indices are not known at compile time and the array will go into local memory (i.e. global memory which is really slow).

Gregory_Diamos · May 27, 2010, 11:38am

But covering the entire memory latency on a single core is overly ambitious. If you do that then you can completely saturate global memory bandwidth with ~3 out of 30 multiprocessors on a GT200 (128-byte transactions per slow clock, ~1.5Ghz fast clock, 4 fast clocks per slow clock, 32-bytes per fast clock, is 48GB/s per core). Are we saying that a GT200 could scale up to ~1.5 TB/s through global memory if they actually made DRAM/off-chip-interconnects that fast?

Besides, if you do that you need 100+ MSHRs to keep track of the outstanding loads, and MSHRs are usually only 4 or 8 entries (not 100+ entries) even on aggressive out of order CPUs (the pentium4 had 8) because they require fully associative searches every time a load is issued and a load completes. They could probably get away with separate MSHR banks per warp (they have to have at least 4 entries per warp from your example because you issued 4 loads in a row), which would bring down the cost of the fully associative search, but that is still a LOT of hardware (~1KB of registers, at least 4 comparators, etc).

Why not just hide the additional latency by letting other warps do some address computation, floating point, etc if you are going to be constrained by bandwidth anyways? That would let the instruction scheduler interleave those loads, in your example you could probably get at least 2-3 instructions between each load. In that case you would only need 100 instructions / (3-4) ~= 25-33 warps (or 800-1024 threads) to cover the latency. Increase the compute to memory ratio further and you need fewer warps/threads. As a side effect it would reduce the register pressure as well.

Topic		Replies	Views
On the register allocation optimization of cuda compiler CUDA Programming and Performance	12	3447	January 20, 2019
Pipelined Loads CUDA Programming and Performance	54	7590	September 21, 2010
Weird use of registers Too many registers are wasted CUDA Programming and Performance	8	5562	July 4, 2007
Cuda compiler will optimize code to use more registers than available by attempting to cache parameters CUDA Programming and Performance	12	2379	November 14, 2017
Questioning compiler's use of registers, how to get compiler to use registers more efficiently CUDA Programming and Performance	10	5293	March 30, 2009
Optimize code from 38 registers to 21 registers CUDA Programming and Performance	48	9745	August 31, 2010
Register usage too high How to reduce register usage? CUDA Programming and Performance	33	7521	December 4, 2011
Analysing the registers CUDA Programming and Performance	9	1258	March 13, 2012
Question about : Kernel optimization , ptaxs register usage, branch divergence, warm up kernel runs CUDA Programming and Performance hw , cuda	5	407	May 7, 2024
A strange phenomenon on register allocation. How to reduce register pressure? CUDA Programming and Performance cuda	14	1665	March 25, 2022

How to reduce register usage

Related topics