Consider this very sophisticated and advanced sorting algorithm for 32 integer numbers :)
#define NUM 32
#define NUMTESTS 65536
__global__ static void bubbleSort(int * input, int *output) {
__shared__ int shared[NUM+1];
shared[NUM]=999999;
const unsigned int tid = threadIdx.x;
for (int test=0; test<NUMTESTS; ++test) {
__syncthreads();
shared[tid] = input[tid];
__threadfence_block();
#pragma unroll 32
for (int i=0; i<32; ++i) {
if ((threadIdx.x&1) == (i&1)) {
if (shared[tid+1]<shared[tid])
swap(shared[tid+1],shared[tid]);
}
__threadfence_block();
}
output[tid] = shared[tid];
}
}
I get the following running time:
Done! GPU execution time: 585.410156 ms
Now consider another code - I simply unroll the for loop manually, 32 times. I am not going to paste the code here for obvious reasons :)
Done! GPU execution time: 388.730560 ms
So my question is: is #pragma unroll really working?
Some answers you might want to hear:
[*]Times are repetitive. It is not one of those strange nasty execution-time spikes.
[*]I am using nvcc -O2 -arch sm_13 (although programming guide does not state if and when unrolling may not work)
[*]Tested on GTX260, Win XP 32bit, Cuda 2.2
[*]I call it in <<<1,32>>> configuration.
[*]I launch the timer just before kernel call and I do use cudaThreadSynchronize() before stopping it.
[*]Algorithm gives correct results in both cases
[*]Bitonic sort in SDK version is slower for arrays of size 32 (600ms), but if you manually unroll all its for-s and squeeze if-else into if (it can be done), you can actually be faster. (In my case I have it in 306ms)