Can an if statement be optimized away when it is in a loop that is unrolled?

If I have time, I will write up some code and read the assembly of it later to check this out. But I was wondering if someone around here already knows the answer to my question and could save me the trouble.

Let’s say you have a For Loop that loops a number of times that can be easily determined at compile time. Now let’s say that there is an If Statement in that loop which only executes on the last iteration. When you throw a “#pragma unroll” in front of the loop, will it optimize away the If Statement? It should look something like this:

#pragma unroll

for (int a = 0; a < 15; a++)

{

//Do some real but uninteresting work here.

if (a == 14)

{

//Do some more work here.

}

}

Or more interesting, will the If Statement get optimized away if this is part of a template?

template <int count> __device__ void someFunction(dataType someParameter)

{

#pragma unroll

for (int a = 0; a < count; a++)

{

//Do some real but uninteresting work here.

if (a == count - 1)

{

//Do some more work here.

}

}

}

I have such a For Loop in a program that I am working on. The loop gets called many millions of times and so this optimization is very important to do whatever I can to optimize. To me, it would make sense that it should get optimized like how one might expect. However, I worry because I am nesting 3 levels of compiler optimization on top of each other and I don’t know if the CUDA compiler is that smart. Does anyone have any experience with this sort of thing and know what the compiler will do? I would appreciate any tips on exploiting compiler optimization that you may have. Thanks!

Sorry if my answer is naive. If i need a cycle $n$ times, and in the last one do some stuff, i would make the cycle for $n-1$ loops, and then copy again the code of the cycle with the stuff i needed in, so i wouldn’t have to ask every time if i’m in the last part. Sure is a little bigger code, but, man, you are unrolling the loop, so is the same effect but just a little faster.

I hope this helps.

Yes, the if gets optimized away. Only potential problem in that area is that the current compiler can only unroll the innermost loop. I don’t know where this limitation comes from, but sometimes I find it quite annoying.

BTW, you can check yourself by looking at the .ptx intermediate output (which is still quite readable, with loop bodies referencing back to the corresponding line number in the source) or by disassembling the binary with cuobjdump.

LOL. Yes that is true; the effect is the same. I should have chosen a better example. I just wanted an if statement that depended on “count” but I still chose something trivial. Perhaps I should have done something more complex like:

if (a % 3 == 2)

{

//do stuff that is cooler than what was being done before.

}

The .ptx file is the assembly that I was referring to. It does reference line number but I still consider it a pain to navigate.

You are right, it all gets optimized away. I wrote a test program and looked at what it generated. Besides noticing that it used way too many registers to get the job done (perhaps it is because I used volatile?), there were no branching statements in there at all, just a lot of “mov.s32” commands, which is what I was hoping for. For those who are curious I will share the source program and ptx assembly.

#include <stdio.h> //printf

#include <cuda.h> //CUDA commands

template<int count> __global__ void kernel(void)

{

	volatile int variable;

	#pragma unroll

	for (int a = 0; a < count; a++)

	{

		variable = a;

		if (a % 3 == 2)

		{

			variable = 100*a;

		}

	}

}

__host__ int main(void)

{

	kernel<8><<<dim3(1), dim3(1)>>>();

	cudaThreadSynchronize();

	//Exit.

	printf("\nPress <ENTER> to exit.\n");

	getchar();

	return 0;

}
.entry _Z6kernelILi8EEvv

	{

	.reg .u32 %r<22>;

	.loc	16	4	0

$LDWbegin__Z6kernelILi8EEvv:

	.loc	16	11	0

	mov.s32 	%r1, 0;

	mov.s32 	%r2, %r1;

	mov.s32 	%r3, 1;

	mov.s32 	%r4, %r3;

	mov.s32 	%r5, 2;

	mov.s32 	%r6, %r5;

	.loc	16	14	0

	mov.s32 	%r7, 200;

	mov.s32 	%r8, %r7;

	.loc	16	11	0

	mov.s32 	%r9, 3;

	mov.s32 	%r10, %r9;

	mov.s32 	%r11, 4;

	mov.s32 	%r12, %r11;

	mov.s32 	%r13, 5;

	mov.s32 	%r14, %r13;

	.loc	16	14	0

	mov.s32 	%r15, 500;

	mov.s32 	%r16, %r15;

	.loc	16	11	0

	mov.s32 	%r17, 6;

	mov.s32 	%r18, %r17;

	mov.s32 	%r19, 7;

	mov.s32 	%r20, %r19;

	.loc	16	17	0

	exit;

$LDWend__Z6kernelILi8EEvv:

	} // _Z6kernelILi8EEvv

Keep in mind that nvcc generates PTX using static single assignment form. Final register assignment is done in the PTX assembler, so you shouldn’t conclude anything based on the number of registers you see in a PTX file. To find out how many registers are actually used in the final cubin, you need to pass --ptxas-options=-v to nvcc.

Yeah, I figured as much since I was able to increase the number of registers to impossible levels by changing the numbers around. Thanks for the tip though!