Predicated Execution Cannot replicate example in PTX ISA 2.1 reference document...

Right now I have a Tesla C2050 running on OpenSUSE 11.1, nV driver 256.35, and toolkit 3.1…and I am having trouble getting predicated execution working on these cards.

According to the PTX ISA 2.1 reference document, page 56, this C code should compile into this PTX code:

C:

if (i < n) j = j + 1;

pseudo-PTX:

	setp.lt.s32 p, i, n;

@p  add.s32 j, j, 1;

But when I try to use that example it never compiles that way! Here’s a reproducible example:

__global__ void testgpu (unsigned int *pixels, unsigned int x)

{

		unsigned int p = pixels[threadIdx.x];

		#pragma unroll 1

		for (int i = 0; i < 8; i++)

		{

				if (i < x)

						p = p + 1;

		}

		pixels[threadIdx.x] = p;

}

I have only included the body of the loop in the PTX output (compiled with nvcc -arch sm_20 --ptx test.cu):

$Lt_0_2562:

		.pragma "nounroll";

 //<loop> Loop body line 24, nesting depth: 1, iterations: 8

		.loc	28	  8	   0

		add.u32		 %r5, %r2, 1;

		setp.lt.u32	 %p1, %r4, %r3;

		selp.u32		%r2, %r5, %r2, %p1;

		add.s32		 %r4, %r4, 1;

		mov.u32		 %r6, 8;

		setp.ne.s32	 %p2, %r4, %r6;

		@%p2 bra		$Lt_0_2562;

Is there something that I am missing here? One of the projects I am working on here would experience a 5-10% speedup if I could get this working, and for this every little bit counts.

It isn’t doing predicated execution, but it is getting rid of the branch in the inner loop using if-conversion. See the selp instruction in your generated code. It should be just as fast as a predicated add.

Yeah, I see where it does that. Add one to a register, set predicate, and then select either the register prior to the add or after the add to store. Problem is that makes up three PTX instructions.

Predicated execution on a single add instruction is two PTX instructions (predicate evaluation, execution). That means it takes 50% longer to run the code if each PTX instruction maps to a single device instruction (it should for this).

I have always thought that nvcc was not very aggressive about generating predicated code and that it would be simple to convert all non-looping control flow into predicate instructions, but like so many other things, I never have time to do it.

However, very rarely is the performance of code significantly affected by a few extra non-memory instructions. You are very likely to have a pipe-line stall due to a register dependency, instruction cache misalignment problem, etc when you get down to this level.

Would it be possible to edit the PTX manually, and see if you actually get a speedup using predication. My gut feeling is that it won’t matter at all, especially if your code is not completely compute bound. If you can actually demonstrate that it is faster using predication, you might get some other people interested in writing if-conversion optimization passes for PTX.

Well for what we are writing, it does matter and it does appear that the run time will be impacted by a significant amount.

I’m not sure if I can hack-up the PTX; I will give it a try. Even if it works though, it is only a temporary fix and not something we can deploy…not to mention I cannot get the hack to work yet anyway.

If I knew why nVidia’s own example does not compile as the document claims it should, then fixing my code wouldn’t take more than a few minutes.