Pipelined Loads

IIRC the undocumented bit was that registers were assigned to blocks in “pages” which had a power of two size (at least for G80/90/GT200 I seem to recall it was 512).

In the old sheet the formula was such that the number of regs used was always for an even number of warps.

I read somewhere recently that on 1.x devices the max number of registers was something like 128. On 2.x it is 63, so it seems as if Fermi might benefit less from these kinds of optimizations.

In the old sheet the formula was such that the number of regs used was always for an even number of warps.

I read somewhere recently that on 1.x devices the max number of registers was something like 128. On 2.x it is 63, so it seems as if Fermi might benefit less from these kinds of optimizations.

Yeah from my testing it was 128 on 1.x on 2.x I’ve heard the limit is 64. Normally you don’t have to unroll so much to get these performance gains and i would argue that Fermi also benefits from this type of optimization.

Yeah from my testing it was 128 on 1.x on 2.x I’ve heard the limit is 64. Normally you don’t have to unroll so much to get these performance gains and i would argue that Fermi also benefits from this type of optimization.

These are the kind of things I remember “hearing” too… and it’s hard to remember where that info came from.

I found an excellent post from Sylvain which gives a well-supported upper limit of 124 registers from PTX itself:

This doesn’t mean the real register per thread limit isn’t lower, but it does mean it can’t be higher. If it does vary by compute level (possible) then it should be added to the Programming guide appendix. Sylvain’s 124 registers-per-thread limit should be in there too!

My raytracing kernel uses 256 threads with 64 regs per thread and I found that asking for 84 registers and 192 threads was always much slower on Fermi. This would make sense if Fermi has a 64 register limit…

These are the kind of things I remember “hearing” too… and it’s hard to remember where that info came from.

I found an excellent post from Sylvain which gives a well-supported upper limit of 124 registers from PTX itself:

This doesn’t mean the real register per thread limit isn’t lower, but it does mean it can’t be higher. If it does vary by compute level (possible) then it should be added to the Programming guide appendix. Sylvain’s 124 registers-per-thread limit should be in there too!

My raytracing kernel uses 256 threads with 64 regs per thread and I found that asking for 84 registers and 192 threads was always much slower on Fermi. This would make sense if Fermi has a 64 register limit…

I hope to find some time in the near future to test my raytracing kernel (that does some extra work next to the raytracing) on a GTX480 I have waiting. Without a reglimit it uses >100, so I am quite wondering…

I hope to find some time in the near future to test my raytracing kernel (that does some extra work next to the raytracing) on a GTX480 I have waiting. Without a reglimit it uses >100, so I am quite wondering…

Ok, interesting. I remember reading about these limits in some documentation long time ago… vague man :) Thought it was documented in best practices or even in the programming guide.

Ok, interesting. I remember reading about these limits in some documentation long time ago… vague man :) Thought it was documented in best practices or even in the programming guide.

So, Vasily. here’s yet another question!

Can ILP still happen in the presence of branching conditionals (masked divergence style) and/or predicated conditionals?

The most common predicate example might be something like:

if (a>b) c+=1.0f;

if (d>e) f+=1.0f;

After thinking about it, my first thought was you cannot use either type of conditional without halting ILP… that predicate register bit must be resolved before the conditional proceeds, so instruction execution stalls while waiting… no ILP.

But perhaps the compiler is really sneaky and actually splits code. In the example above, it reorders the tests to make 2 tests first (each with a different predicate register) then puts the predicate conditional code after the multiple tests. G80 has 4 predicate registers per thread (I don’t know about GF100) but I don’t know if those get scoreboard scheduling too, and if the compiler splits conditional statements into two halves (predicate and conditional code) and stuffs other code inbetween if possible.


Question 2, and related to the “does the compiler split predicate tests?” question above.

Does the compiler use a peephole optimizer to try to interleave independent instructions to increase ILP? It would also be similar to what GF104 needs for superscalar execution. But GF104 reviews said “The GF104 compiler in NVIDIA’s drivers will try to organize code to better match GF104’s superscalar abilities, but it’s not critical to the ability.” which implies that such reordering hasn’t been done before.

So, Vasily. here’s yet another question!

Can ILP still happen in the presence of branching conditionals (masked divergence style) and/or predicated conditionals?

The most common predicate example might be something like:

if (a>b) c+=1.0f;

if (d>e) f+=1.0f;

After thinking about it, my first thought was you cannot use either type of conditional without halting ILP… that predicate register bit must be resolved before the conditional proceeds, so instruction execution stalls while waiting… no ILP.

But perhaps the compiler is really sneaky and actually splits code. In the example above, it reorders the tests to make 2 tests first (each with a different predicate register) then puts the predicate conditional code after the multiple tests. G80 has 4 predicate registers per thread (I don’t know about GF100) but I don’t know if those get scoreboard scheduling too, and if the compiler splits conditional statements into two halves (predicate and conditional code) and stuffs other code inbetween if possible.


Question 2, and related to the “does the compiler split predicate tests?” question above.

Does the compiler use a peephole optimizer to try to interleave independent instructions to increase ILP? It would also be similar to what GF104 needs for superscalar execution. But GF104 reviews said “The GF104 compiler in NVIDIA’s drivers will try to organize code to better match GF104’s superscalar abilities, but it’s not critical to the ability.” which implies that such reordering hasn’t been done before.

Answering part of my own question, I think the compiler does try to rearrange its instructions to improve ILP. This is done in the post-PTX compiler, though, so it’s not immediately obvious.

Some example CUDA code does a dumb loop:

for (int i=0; i<0x10000000; i++) {

	v=123*v+456;

  }

which is spit out into .ptx which follows the C pretty closely

$Lt_0_5634:

	mul.lo.s32 	%r14, %r12, 123;

	add.s32 	%r12, %r14, 456;

	add.s32 	%r13, %r13, 1;

	mov.u32 	%r15, 268435456;

	setp.ne.s32 	%p2, %r13, %r15;

	@%p2 bra 	$Lt_0_5634;

Notice this PTX is not very ILP friendly. The second line depends on the first, the fourth depends on the third, and the fifth depends on the fourth.

But this isn’t what the device runs. I used decuda to find the true code (which is less readable).

000038: 103b800d 00000007 label0: mov.b32 $r3, 0x0000007b

000040: 40070011 00000780 mul24.lo.u32.u16.u16 $r4, $r0.lo, $r3.hi

000048: 60060211 00010780 mad24.lo.u32.u16.u16.u32 $r4, $r0.hi, $r3.lo, $r4

000050: 30100811 c4100780 shl.u32 $r4, $r4, 0x00000010

000058: 20018409 00000003 add.b32 $r2, $r2, 0x00000001

000060: 60060001 00010780 mad24.lo.u32.u16.u16.u32 $r0, $r0.lo, $r3.lo, $r4

000068: 308005fd 6c4147c8 set.ne.s32 $p0|$o127, $r2, c1[0x0000]

000070: 20088001 0000001f add.b32 $r0, $r0, 0x000001c8

000078: 10007003 00000280 @$p0.ne bra.label label0

Notice here the compiler does split up some dependencies, trying to separate them at least.

In particular, the creation of the $p0 predicate WAS separated from its use, so predicates are likely scoreboarded just like real registers and therefore eligible for ILP. (Makes sense). The +456 addition was pushed way down from its previous *123 access, so that move also boosted ILP.

Answering part of my own question, I think the compiler does try to rearrange its instructions to improve ILP. This is done in the post-PTX compiler, though, so it’s not immediately obvious.

Some example CUDA code does a dumb loop:

for (int i=0; i<0x10000000; i++) {

	v=123*v+456;

  }

which is spit out into .ptx which follows the C pretty closely

$Lt_0_5634:

	mul.lo.s32 	%r14, %r12, 123;

	add.s32 	%r12, %r14, 456;

	add.s32 	%r13, %r13, 1;

	mov.u32 	%r15, 268435456;

	setp.ne.s32 	%p2, %r13, %r15;

	@%p2 bra 	$Lt_0_5634;

Notice this PTX is not very ILP friendly. The second line depends on the first, the fourth depends on the third, and the fifth depends on the fourth.

But this isn’t what the device runs. I used decuda to find the true code (which is less readable).

000038: 103b800d 00000007 label0: mov.b32 $r3, 0x0000007b

000040: 40070011 00000780 mul24.lo.u32.u16.u16 $r4, $r0.lo, $r3.hi

000048: 60060211 00010780 mad24.lo.u32.u16.u16.u32 $r4, $r0.hi, $r3.lo, $r4

000050: 30100811 c4100780 shl.u32 $r4, $r4, 0x00000010

000058: 20018409 00000003 add.b32 $r2, $r2, 0x00000001

000060: 60060001 00010780 mad24.lo.u32.u16.u16.u32 $r0, $r0.lo, $r3.lo, $r4

000068: 308005fd 6c4147c8 set.ne.s32 $p0|$o127, $r2, c1[0x0000]

000070: 20088001 0000001f add.b32 $r0, $r0, 0x000001c8

000078: 10007003 00000280 @$p0.ne bra.label label0

Notice here the compiler does split up some dependencies, trying to separate them at least.

In particular, the creation of the $p0 predicate WAS separated from its use, so predicates are likely scoreboarded just like real registers and therefore eligible for ILP. (Makes sense). The +456 addition was pushed way down from its previous *123 access, so that move also boosted ILP.