#pragma unroll not working?

Consider this very sophisticated and advanced sorting algorithm for 32 integer numbers :)

#define NUM 32

#define NUMTESTS 65536

__global__ static void bubbleSort(int * input, int *output) {

	__shared__ int shared[NUM+1];

	shared[NUM]=999999;

	const unsigned int tid = threadIdx.x;

	for (int test=0; test<NUMTESTS; ++test) {

		__syncthreads();

		shared[tid] = input[tid];

		__threadfence_block();

#pragma unroll 32

		for (int i=0; i<32; ++i) {

			if ((threadIdx.x&1) == (i&1)) {

				if (shared[tid+1]<shared[tid])

					swap(shared[tid+1],shared[tid]);

			}

			__threadfence_block();

		}

		output[tid] = shared[tid];

	}

}

I get the following running time:

Done! GPU execution time: 585.410156 ms

Now consider another code - I simply unroll the for loop manually, 32 times. I am not going to paste the code here for obvious reasons :)

Done! GPU execution time: 388.730560 ms

So my question is: is #pragma unroll really working?

Some answers you might want to hear:

    [*]Times are repetitive. It is not one of those strange nasty execution-time spikes.

    [*]I am using nvcc -O2 -arch sm_13 (although programming guide does not state if and when unrolling may not work)

    [*]Tested on GTX260, Win XP 32bit, Cuda 2.2

    [*]I call it in <<<1,32>>> configuration.

    [*]I launch the timer just before kernel call and I do use cudaThreadSynchronize() before stopping it.

    [*]Algorithm gives correct results in both cases

    [*]Bitonic sort in SDK version is slower for arrays of size 32 (600ms), but if you manually unroll all its for-s and squeeze if-else into if (it can be done), you can actually be faster. (In my case I have it in 306ms)

The easiest way to check is to call nvcc with the --ptx flag. Then examine the generated PTX file to see if unrolling has occurred.

You can also check the cubin file with decuda, then you really know what is going on.

I have seen cases where a loop didn’t get unrolled because the iterator variable ‘i’ was replaced by the compiler to a variable which was used inside the loop. No matter how hard I try to use the #pragma, the loop didn’t get unrolled. However, on the other hand, I did see loop unrolling on loops where I didn’t use the #pragma statement at all, so loop unrolling seems to be out of the programmers control.

It’s not entirely out of your control. The programming guide says that the default behavior of the compiler is to unroll short loops when possible. You can disable unrolling for a particular loop with #pragma unroll 1. There are some loops which the compiler cannot unroll for some reason, which is a compiler limitation, and also some loops which are not unrolled even when given an explicit trip count, which I believe is a bug.