How to reduce register usage

Haha! “schedsucks” is a great name for this test case.

To me it is clear where the the need for 29 registers comes from. It has to keep all
the values of A[1] to A[29] in registers - you’re lucky it did not move the entire array
to local memory. So far I thought large arrays always go to local memory.

I have no clue why it would use twice the amount of registers that I think it should.
So 58 is definitely not cool.

Christian

I have no idea either… I just ran it through Ocelot (the first version), and I get 31 registers. I had thought that ptxas was doing something like linear scan allocation (same as ocelot), but maybe it is something even simpler? And Ocelot’s version doesn’t even do back-to-back spill/fill elimination, it is really as basic as you can get.

Edit: I get 2 registers for the second merged version using ocelot…

I have no idea either… I just ran it through Ocelot (the first version), and I get 31 registers. I had thought that ptxas was doing something like linear scan allocation (same as ocelot), but maybe it is something even simpler? And Ocelot’s version doesn’t even do back-to-back spill/fill elimination, it is really as basic as you can get.

Edit: I get 2 registers for the second merged version using ocelot…

Well if the name is meant to imply that the instruction scheduler sucks, you aren’t really giving it a chance with that example. It requires you to hoist loads before previous stores, which in general violates sequential consistency from the perspective of a single thread.

That is one of the hardest optimizations to do in a compiler, you have to prove that result[i] never aliases result[i + n] for any value of n in [i+1,30) and result is a pointer. It is probably possible in this example (both the load and store reference the same value for a given iteration), but the type of analysis required for doing this proof automatically is very involved.

Well if the name is meant to imply that the instruction scheduler sucks, you aren’t really giving it a chance with that example. It requires you to hoist loads before previous stores, which in general violates sequential consistency from the perspective of a single thread.

That is one of the hardest optimizations to do in a compiler, you have to prove that result[i] never aliases result[i + n] for any value of n in [i+1,30) and result is a pointer. It is probably possible in this example (both the load and store reference the same value for a given iteration), but the type of analysis required for doing this proof automatically is very involved.

Actually I think I know why.

ptxas tries to group memory fetches to cover latency by pipelining. It’s actually quite smart, or at least it tries to be.

See my paper in this [post=“1059596”]post[/post]. In all compiled examples memory fetches are grouped.

It does, massively, break occupancy at times but the idea is good. Too bad there’s no control over it.

Edit: quote added

1 Like

Actually I think I know why.

ptxas tries to group memory fetches to cover latency by pipelining. It’s actually quite smart, or at least it tries to be.

See my paper in this [post=“1059596”]post[/post]. In all compiled examples memory fetches are grouped.

It does, massively, break occupancy at times but the idea is good. Too bad there’s no control over it.

Edit: quote added

Edit: Sorry I misread your post, you are right, of course.

Edit2: Actually I didn’t misread it, you edited it :P. Well, I’m less sorry in that case, but you’re still right.

1 Like

Edit: Sorry I misread your post, you are right, of course.

Edit2: Actually I didn’t misread it, you edited it :P. Well, I’m less sorry in that case, but you’re still right.

Yeah, I had to go back and look at the example to realize that result was the input, not A. Sorry for the confusion, the edit feature makes it less likely that I will proofread before posting :)

Thanks for the link. That was a very interesting read. I’m not sure if the behavior that you observed can be explained by covering pipeline latency though. It seems like the latency of a load should dwarf that of an integer instruction. In your optimized example, there are 6 integer instructions to cover the latency of the last load compared to the broken pipelining example where there are 3. It almost seems like the loads can issue back-to-back as in an out of order processor unless there is a register dependent on the load.

Could you run the following experiments?

// A optimized with stalls

add.b32   $r0,s [ 0x0010 ],0 x00000004 

add.b32   $r1,s [ 0x0010 ],0 x0000000c 

mov.b32   $r2,s [ 0x0010 ]			   

add.b32   $r3,s [ 0x0010 ],0 x00000008 

mov.u32   $r0,g [ $r0 ]				 

mov.u32   $r10,$r0		 

mov.u32   $r1,g [ $r1 ]				 

mov.u32   $r11,$r1		 

mov.u32   $r2,g [ $r2 ]				

mov.u32   $r12,$r2		 

mov.u32   $r3,g [ $r3 ]				 

mov.u32   $r12,$r3		 

shl.u32   $r0,$r0,0x00000006		   

shl.u32   $r1,$r1,0x00000006

add.b32   $r2,$r0,$r2

add.b32   $r0,s [ 0x0010 ],0 x00000010

add.b32   $r3,$r1,$r3

add.u32   $r3,$r2,$r3

mov.u32   g [ $r0 ],$r3

// B reduced pipelining with stalls											 

add.b32   $r0,s [ 0 x0010 ],0 x00000004

mov.u32   $r2,g [ $r0 ]				 

mov.u32   $r12,$r2		 

shl.u32   $r2,$r2,0 x00000006

mov.u32   $r1,g [ $r0 ]

mov.u32   $r11,$r1		 

add.b32   $r0,$r2,$r1

add.b32   $r1,s [ 0 x0010 ],0 x0000000c

mov.u32   $r2,g [ $r1 ]

mov.u32   $r12,$r2		 

shl.u32   $r2,$r2,0 x00000006

add.b32   $r1,s [ 0 x0010 ],0 x00000008

mov.u32   $r3,g [ $r1 ]

mov.u32   $r13,$r3		 

add.b32   $r1,$r2,$r3

add.b32   $r3,$r0,$r1

add.b32   $r2,s [ 0 x0010 ],0 x00000010

mov.u32   g [ $r2 ],$r3

I am curious whether the difference is due to pipelining (in which case A should outperform B ) or due to issuing loads back to back (if so they should perform similarly)

Yeah, I had to go back and look at the example to realize that result was the input, not A. Sorry for the confusion, the edit feature makes it less likely that I will proofread before posting :)

Thanks for the link. That was a very interesting read. I’m not sure if the behavior that you observed can be explained by covering pipeline latency though. It seems like the latency of a load should dwarf that of an integer instruction. In your optimized example, there are 6 integer instructions to cover the latency of the last load compared to the broken pipelining example where there are 3. It almost seems like the loads can issue back-to-back as in an out of order processor unless there is a register dependent on the load.

Could you run the following experiments?

// A optimized with stalls

add.b32   $r0,s [ 0x0010 ],0 x00000004 

add.b32   $r1,s [ 0x0010 ],0 x0000000c 

mov.b32   $r2,s [ 0x0010 ]			   

add.b32   $r3,s [ 0x0010 ],0 x00000008 

mov.u32   $r0,g [ $r0 ]				 

mov.u32   $r10,$r0		 

mov.u32   $r1,g [ $r1 ]				 

mov.u32   $r11,$r1		 

mov.u32   $r2,g [ $r2 ]				

mov.u32   $r12,$r2		 

mov.u32   $r3,g [ $r3 ]				 

mov.u32   $r12,$r3		 

shl.u32   $r0,$r0,0x00000006		   

shl.u32   $r1,$r1,0x00000006

add.b32   $r2,$r0,$r2

add.b32   $r0,s [ 0x0010 ],0 x00000010

add.b32   $r3,$r1,$r3

add.u32   $r3,$r2,$r3

mov.u32   g [ $r0 ],$r3

// B reduced pipelining with stalls											 

add.b32   $r0,s [ 0 x0010 ],0 x00000004

mov.u32   $r2,g [ $r0 ]				 

mov.u32   $r12,$r2		 

shl.u32   $r2,$r2,0 x00000006

mov.u32   $r1,g [ $r0 ]

mov.u32   $r11,$r1		 

add.b32   $r0,$r2,$r1

add.b32   $r1,s [ 0 x0010 ],0 x0000000c

mov.u32   $r2,g [ $r1 ]

mov.u32   $r12,$r2		 

shl.u32   $r2,$r2,0 x00000006

add.b32   $r1,s [ 0 x0010 ],0 x00000008

mov.u32   $r3,g [ $r1 ]

mov.u32   $r13,$r3		 

add.b32   $r1,$r2,$r3

add.b32   $r3,$r0,$r1

add.b32   $r2,s [ 0 x0010 ],0 x00000010

mov.u32   g [ $r2 ],$r3

I am curious whether the difference is due to pipelining (in which case A should outperform B ) or due to issuing loads back to back (if so they should perform similarly)

I never wrote it can be explained by “covering pipeline latency” ;)

I wrote that it can be explained by covering memory latency by pipelining accesses to it. That is: by pipelining memory fetches, so they are all in flight at the same time.

Your two examples do break pipelining (and they perform equally, 3200+ cycles) - the next instruction after fetch is dependent on it, so one memory fetch must complete before the next one starts. In examples optimized for maximum pipelining all 4 memory fetches are in flight at the same time (kernel does not block on them).

Edit: actually I think I need to rewrite/reword my article, as it’s the second time it caused that exact same confusion ;)

I never wrote it can be explained by “covering pipeline latency” ;)

I wrote that it can be explained by covering memory latency by pipelining accesses to it. That is: by pipelining memory fetches, so they are all in flight at the same time.

Your two examples do break pipelining (and they perform equally, 3200+ cycles) - the next instruction after fetch is dependent on it, so one memory fetch must complete before the next one starts. In examples optimized for maximum pipelining all 4 memory fetches are in flight at the same time (kernel does not block on them).

Edit: actually I think I need to rewrite/reword my article, as it’s the second time it caused that exact same confusion ;)

Sorry, you’re right, I misinterpreted what you wrote.

Thanks for running the code. That’s a very interesting (undocumented?) capability of the hardware that I would not have expected (why do you need to support concurrent accesses from the same thread when different warps can issue concurrent accesses as well?). That means that tesla has to track per register dependencies in order to determine that the second load is not waiting for the address to be produced from the previous load and that they have to have something like MSHRs on a per-thread/warp basis to handle multiple outstanding memory requests from the same thread… That is walking a fine line between in-order and out-of-order execution.

So from a register allocation perspective, you want schedule code such that independent loads are scheduled back-to-back until you run out of MSHRs or completely cover the memory latency, then start scheduling based on def-use distance to reduce the register pressure. It seems like ptxas is doing the former but not the latter…

Sorry, you’re right, I misinterpreted what you wrote.

Thanks for running the code. That’s a very interesting (undocumented?) capability of the hardware that I would not have expected (why do you need to support concurrent accesses from the same thread when different warps can issue concurrent accesses as well?). That means that tesla has to track per register dependencies in order to determine that the second load is not waiting for the address to be produced from the previous load and that they have to have something like MSHRs on a per-thread/warp basis to handle multiple outstanding memory requests from the same thread… That is walking a fine line between in-order and out-of-order execution.

So from a register allocation perspective, you want schedule code such that independent loads are scheduled back-to-back until you run out of MSHRs or completely cover the memory latency, then start scheduling based on def-use distance to reduce the register pressure. It seems like ptxas is doing the former but not the latter…

I got another idea that I screwed up the benchmarks, because fetches are from consecutive addresses, so they could be coalesced (hence one memory transaction).

But if I change addresses so that they for sure can’t coalesce, the execution time is still low (~1300) in the “pipelined” version and high in the “broken”.

So that’s a nice piece of hardware and a bit less nice piece of a compiler. ;)

Edit: just saw your edit.

That’s an easy question. Because you can’t run enough warps to fully cover memory latency. Memory latency is 400+ cycles. Instruction latency is 4 cycles, so you need 400/4=100+ warps per SM to fully cover gmem latency. GT200 can only do 8 active blocks per SM. A block can be at most 512 threads (and that’s unlikely). So 8512/32=128 warps (active blocksblock_size/warp_size). That’s barely the 100+ needed.

I got another idea that I screwed up the benchmarks, because fetches are from consecutive addresses, so they could be coalesced (hence one memory transaction).

But if I change addresses so that they for sure can’t coalesce, the execution time is still low (~1300) in the “pipelined” version and high in the “broken”.

So that’s a nice piece of hardware and a bit less nice piece of a compiler. ;)

Edit: just saw your edit.

That’s an easy question. Because you can’t run enough warps to fully cover memory latency. Memory latency is 400+ cycles. Instruction latency is 4 cycles, so you need 400/4=100+ warps per SM to fully cover gmem latency. GT200 can only do 8 active blocks per SM. A block can be at most 512 threads (and that’s unlikely). So 8512/32=128 warps (active blocksblock_size/warp_size). That’s barely the 100+ needed.

It’s not that strange, in the second case the compiler realizes that you don’t need the A variable at all, since you simply can write result[i] ^= result[i]…

Your array does not go into local memory since the array indices are known at compile time (thanks to the loop unrolling), without the loop unrolling the indices are not known at compile time and the array will go into local memory (i.e. global memory which is really slow).

It’s not that strange, in the second case the compiler realizes that you don’t need the A variable at all, since you simply can write result[i] ^= result[i]…

Your array does not go into local memory since the array indices are known at compile time (thanks to the loop unrolling), without the loop unrolling the indices are not known at compile time and the array will go into local memory (i.e. global memory which is really slow).

But covering the entire memory latency on a single core is overly ambitious. If you do that then you can completely saturate global memory bandwidth with ~3 out of 30 multiprocessors on a GT200 (128-byte transactions per slow clock, ~1.5Ghz fast clock, 4 fast clocks per slow clock, 32-bytes per fast clock, is 48GB/s per core). Are we saying that a GT200 could scale up to ~1.5 TB/s through global memory if they actually made DRAM/off-chip-interconnects that fast?

Besides, if you do that you need 100+ MSHRs to keep track of the outstanding loads, and MSHRs are usually only 4 or 8 entries (not 100+ entries) even on aggressive out of order CPUs (the pentium4 had 8) because they require fully associative searches every time a load is issued and a load completes. They could probably get away with separate MSHR banks per warp (they have to have at least 4 entries per warp from your example because you issued 4 loads in a row), which would bring down the cost of the fully associative search, but that is still a LOT of hardware (~1KB of registers, at least 4 comparators, etc).

Why not just hide the additional latency by letting other warps do some address computation, floating point, etc if you are going to be constrained by bandwidth anyways? That would let the instruction scheduler interleave those loads, in your example you could probably get at least 2-3 instructions between each load. In that case you would only need 100 instructions / (3-4) ~= 25-33 warps (or 800-1024 threads) to cover the latency. Increase the compute to memory ratio further and you need fewer warps/threads. As a side effect it would reduce the register pressure as well.