Annoying problems of CUDA compiler

After implement some applications in CUDA I have detected some annoying problems in CUDA.

In simple codes (without deep loops and conditions) I have not detected any problem. But for complex codes where it is needed to implement deep loops (multiple levels) and with a large number of parameters and functions I have detected some fails in CUDA.

They happen in complex codes and therefore is very complex to isolated the fails because if you change some part of the code, then the fail disappears. Therefore is difficult to shared the code in this forum.

I give you a simple example of a function that is part of a complex code:

__device__ void function(float *result, float* array1, float *array2, int iters1, int iters2){
	int count = 0;
  	printf("ITERS2: %i\n");
  	for (int m=0; m<iters1; m++){
  	for (int m=0; m<iters2; m++){
	printf("ITERS2: %i ....... count: %i\n",iters2,count);

This simple code normally doesn’t fail, but inside a complex code it only does 1 iteration in the second loop with iters2=8, and this function is only called from a thread.

ITERS2: 8 ....... count: 1

Something very annoying. Also I have detected some problems with __syncthreads() in some complex codes.

You will have to come up with some complete repro cases (as much simplified as possible) if you expect nVidia to fix it.

It is difficult, because as I said if I simplify the code the error disappears.

However I have checked that the problems happen when I use dynamic Shared memory. So I think the problems in the compiler are related with this, and when the charge in the work is not balanced between warps of the same block.

Thanks !