#pragma unroll not working?

Cygnus_X1 · June 5, 2009, 7:48pm

Consider this very sophisticated and advanced sorting algorithm for 32 integer numbers :)

#define NUM 32

#define NUMTESTS 65536

__global__ static void bubbleSort(int * input, int *output) {

	__shared__ int shared[NUM+1];

	shared[NUM]=999999;

	const unsigned int tid = threadIdx.x;

	for (int test=0; test<NUMTESTS; ++test) {

		__syncthreads();

		shared[tid] = input[tid];

		__threadfence_block();

#pragma unroll 32

		for (int i=0; i<32; ++i) {

			if ((threadIdx.x&1) == (i&1)) {

				if (shared[tid+1]<shared[tid])

					swap(shared[tid+1],shared[tid]);

			}

			__threadfence_block();

		}

		output[tid] = shared[tid];

	}

}

I get the following running time:

Done! GPU execution time: 585.410156 ms

Now consider another code - I simply unroll the for loop manually, 32 times. I am not going to paste the code here for obvious reasons :)

Done! GPU execution time: 388.730560 ms

So my question is: is #pragma unroll really working?

Some answers you might want to hear:

[*]Times are repetitive. It is not one of those strange nasty execution-time spikes.

[*]I am using nvcc -O2 -arch sm_13 (although programming guide does not state if and when unrolling may not work)

[*]Tested on GTX260, Win XP 32bit, Cuda 2.2

[*]I call it in <<<1,32>>> configuration.

[*]I launch the timer just before kernel call and I do use cudaThreadSynchronize() before stopping it.

[*]Algorithm gives correct results in both cases

[*]Bitonic sort in SDK version is slower for arrays of size 32 (600ms), but if you manually unroll all its for-s and squeeze if-else into if (it can be done), you can actually be faster. (In my case I have it in 306ms)

seibert · June 5, 2009, 9:28pm

The easiest way to check is to call nvcc with the --ptx flag. Then examine the generated PTX file to see if unrolling has occurred.

Gert-Jan · June 8, 2009, 10:21am

You can also check the cubin file with decuda, then you really know what is going on.

I have seen cases where a loop didn’t get unrolled because the iterator variable ‘i’ was replaced by the compiler to a variable which was used inside the loop. No matter how hard I try to use the #pragma, the loop didn’t get unrolled. However, on the other hand, I did see loop unrolling on loops where I didn’t use the #pragma statement at all, so loop unrolling seems to be out of the programmers control.

seibert · June 8, 2009, 1:43pm

It’s not entirely out of your control. The programming guide says that the default behavior of the compiler is to unroll short loops when possible. You can disable unrolling for a particular loop with #pragma unroll 1. There are some loops which the compiler cannot unroll for some reason, which is a compiler limitation, and also some loops which are not unrolled even when given an explicit trip count, which I believe is a bug.

Topic		Replies	Views
#Pragma unroll doesn't work? CUDA Programming and Performance	8	6130	September 19, 2008
#pragma unroll not behaving as expected CUDA Programming and Performance	1	536	September 10, 2022
CUDA #pragma unroll is slower than unrolled code CUDA Programming and Performance	2	2061	February 7, 2018
#pragma unroll? CUDA Programming and Performance	15	43184	March 21, 2008
Extension cl_nv_pragma_unroll doesn't seem to work CUDA Programming and Performance	4	20220	October 12, 2011
Problem with unrolling loops CUDA Programming and Performance	9	8723	November 24, 2011
automatic loop unrolling CUDA Programming and Performance	8	11227	July 2, 2009
Different output of code when not unrolling loop CUDA Programming and Performance	16	1249	August 22, 2022
loop unrolling CUDA Programming and Performance	11	17201	January 31, 2008
Loop unrolling CUDA Programming and Performance	3	2739	April 25, 2012

#pragma unroll not working?

Related topics