Let’s say you write a program that has a fixed but arbitrary sized array that it must compute on. You want this program to run well on any GPU so you can’t know how many blocks or how many threads per block will be available until the code runs. Assuming it is possible (but not necessary) that the number of threads available is less than the number of elements in the array, how do you deal with this in an efficient manner? Well, the code below is my solution. Now, I know there are a lot of arbitrary numbers in there, some of which are quite ridiculous (e.g. all current nvidia cards have a max thread limit of 512 or 1024 so listing the others is not necessary), but this code is just an example.

This code does almost all of the thinking for you. You just need to specify the calculation you want done and the size of the array, which can be anything. I have compiled this code (compute capability 1.0 compatible) and read the ptx code. It does indeed remove all branching statements (except for “x < DATA_SIZE” during the last pass in a loop). It even unrolls the two nested for loops. So, it cannot get anymore efficient for *all* GPUs. Enjoy!

```
/**
* UnrollingExample.cu
* Demonstrates how to unroll loops while using a template.
*
* DBJ
* 8 August 2011
**/
#include <cuda.h>
#define DATA_SIZE 4000
#define FIRST_LIMIT 3
#define SECOND_LIMIT 2
template<int blockSize,int blockIterations> __global__ void kernel()
{
__shared__ float anArray[DATA_SIZE]; //Or some other sort of array.
int x;
#pragma unroll
for (int a = 0; a < blockIterations; a++)
{
x = a*blockSize + threadIdx.x;
if (a == 0 || a < blockIterations - 1 || x < DATA_SIZE)
{
#pragma unroll
for (int b = 0; b < FIRST_LIMIT; b++)
{
#pragma unroll
for (int c = 0; c < SECOND_LIMIT; c++)
anArray[x] = x*3.14f; //Or do some other calculation here.
}
}
}
}
int main()
{
cudaDeviceProp device;
//Get the first device.
cudaGetDeviceProperties(&device, 0);
//int maxBlocks = deviceProp.multiProcessorCount; //Just in case you want to use it.
int maxThreads = device.maxThreadsPerBlock;
if (maxThreads >= DATA_SIZE)
kernel <0, 1><<<dim3(1), dim3(DATA_SIZE)>>>();
else
{
switch (maxThreads)
{
case 2048: kernel <2048, (DATA_SIZE + 2047) / 2048><<<dim3(1), dim3(maxThreads)>>>(); break;
case 1024: kernel <1024, (DATA_SIZE + 1023) / 1024><<<dim3(1), dim3(maxThreads)>>>(); break;
case 512: kernel <512, (DATA_SIZE + 511) / 512><<<dim3(1), dim3(maxThreads)>>>(); break;
case 256: kernel <256, (DATA_SIZE + 255) / 256><<<dim3(1), dim3(maxThreads)>>>(); break;
case 128: kernel <128, (DATA_SIZE + 127) / 128><<<dim3(1), dim3(maxThreads)>>>(); break;
case 64: kernel <64, (DATA_SIZE + 63) / 64><<<dim3(1), dim3(maxThreads)>>>(); break;
case 32: kernel <32, (DATA_SIZE + 31) / 32><<<dim3(1), dim3(maxThreads)>>>(); break;
}
}
}
```

Now to explain some of the nuances. This statement “if (a == 0 || a < blockIterations - 1 || x < DATA_SIZE)” gets almost entirely removed in the unrolling. We need to be sure that we don’t accidentally try accessing values outside of the array (the “x < DATA_SIZE” part). If there are more threads than elements of the array, we don’t need to know the number of threads (“kernel <0, 1>”) and we just let it compute (thus the “a == 0” part gets optimized away). If there are several times more elements than threads, all threads compute each loop except the last loop (thus the “a < blockIterations - 1” part). Also, the “x = a*blockSize + threadIdx.x;" looks like it is doing some multiplication. However “blockSize” is a constant because of the template and since the loop gets unrolled "a*blockSize” is known at compile time, which reduces this just to a simple addition.

You will notice that I have this strange formula when calling the kernels, “(DATA_SIZE + 1023) / 1024”. This is my integer ceiling function that doesn’t rely on a function call (and thus can be known at compile time). The mathematics is as follows: ceil(x / y) = floor((x+y-1) / y) assuming that x and y are both integers. This formula does work for floating point variables x and y. So, I think that is everything. I hope you like it!

P.S. In case you were wondering about those two innermost loops that seem quite unnecessary, I should explain. This code was originally part of a project where I was training a bunch of small neural networks so I had a ton of those small nested loops. It was good to see those loops unroll along with everything else and having all of those conditionals just disappear.