Right now I have a Tesla C2050 running on OpenSUSE 11.1, nV driver 256.35, and toolkit 3.1…and I am having trouble getting predicated execution working on these cards.
According to the PTX ISA 2.1 reference document, page 56, this C code should compile into this PTX code:
C:
if (i < n) j = j + 1;
pseudo-PTX:
setp.lt.s32 p, i, n;
@p add.s32 j, j, 1;
But when I try to use that example it never compiles that way! Here’s a reproducible example:
__global__ void testgpu (unsigned int *pixels, unsigned int x)
{
unsigned int p = pixels[threadIdx.x];
#pragma unroll 1
for (int i = 0; i < 8; i++)
{
if (i < x)
p = p + 1;
}
pixels[threadIdx.x] = p;
}
I have only included the body of the loop in the PTX output (compiled with nvcc -arch sm_20 --ptx test.cu):
$Lt_0_2562:
.pragma "nounroll";
//<loop> Loop body line 24, nesting depth: 1, iterations: 8
.loc 28 8 0
add.u32 %r5, %r2, 1;
setp.lt.u32 %p1, %r4, %r3;
selp.u32 %r2, %r5, %r2, %p1;
add.s32 %r4, %r4, 1;
mov.u32 %r6, 8;
setp.ne.s32 %p2, %r4, %r6;
@%p2 bra $Lt_0_2562;
Is there something that I am missing here? One of the projects I am working on here would experience a 5-10% speedup if I could get this working, and for this every little bit counts.