Long time to compile, also question about best way to allocate local variables

So I am working on a project and i wrote a cuda kernel and a bunch of small device functions that it calls. the cuda kernel is run with ~960 blocks of 16x16 threads. I found that when i compile the .cu file, nvcc first spits out a bunch of the warning message below, but then sits “idle” for a while (ranging from 5-20 minutes, depending on computer) before it finished compiling. I was wondering what is causing the compiler to take so long?

my other questions is how to properly allocate the local variables I need for my functions. I am adapting someone else’s code and some of the functions are pretty hideous and take in like 20 arguments and try to allocated ~40 local variables. I think I read that the integers (most of the local vars) would be put in registers and with them as they were, i got a error from nvcc saying it ran out of registers. I was wondering what the best way to handle these kinds of functions is? I ended up created a struct with all the needed fields and just allocatin that, as I think it goes into shared memory if i read that right. I could also break up the functions more, but then I would need to pass more and then the functions with ~20 parameters would just get even worse. Any ideas?

Putting the automatic variables inside a struct will cause it to be placed in local memory, not shared => You will get performance hit from it (assuming that the latency cannot be completely hidden), but if you run out of registers you run out of registers.

One idea is to put as much stuff as possible manually on shared memory, but it is time-consuming to do because they are shared variables and therefore you must assign yourself one in an array of shared variables.

For example:

int foo(int bar, int bor)

{

	 int x;

	 x = bar * bor;

	 return x;

}

goes to (with 256 threads)

int foo(int bar, int bor)

{

	 __shared__ int xs[256];

	 // One entry for each thread in xs, simulating a stack variable

	 xs[threadIdx.x] = bar * bor;

	 return xs[threadIdx.x];

}

or something similar, but this of course is really annoying to do, and you run out of shared memory quite quickly (remember that you need to think of number of active blocks also)

=> Feature wish:

sharedlocal keyword (or similar) that does the above magic trick automatically.

Also a good way to help the compiler is to use a concept called minimal scoping:

minimize the lifetime of all of your automatic variables.

Example

Instead of

int foo(int* data, int w, int h, int stride)

{

	 int result = 0;

	 int i, j;

	 for (i = 0; i < w; i++)

	 {

		   for (j = 0; j < h; j++)

				result += data[i * stride + j];

	 }

	 return result;

}

write

int foo(int* data, int w, int h, int stride)

{

	 int result = 0;

	 int i;

	 for (i = 0; i < w; i++)

	 {

		   int j;

		   for (j = 0; j < h; j++)

				result += data[i * stride + j];

	 }

	 return result;

}

This way the compiler has a little less work to do in the function level scope.

Also since you have 256 threads and run out of registers, you should limit them using

-maxregcount compiler flag (if you use common.mk from sdk set maxregisters := 32 in your makefile)

Then if you are really desperate about compilation time, you can specify the following compilation flag to nvcc to remove some optimizations,

but I’m unsure how this affects performance (they said also that this would not be needed in 2.2):

CUDACCFLAGS += --opencc-options -WOPT:expr_reass=off