Cuda 4.1 broke my kernel Upgraded from 4.0 to 4.1

Hi,

I just updated to cuda version 4.1.28, updating the following products / drivers:

  • cuda took kit 4.1.28
  • dev driver windows 7 64 286.16
  • gpu sdk 4.1.28
  • parallel nsight 2.1

Since updating one of my kernels keeps failing with just rubbish output, where it is not displaying any errors or the cuda runtime does not report any errors. The rest of my kernels run fine, it is just one that is causing an issue, which had ran flawlessly in cuda 4.0.17.

I did some testing and was able to get the kernel to produce the correct result, by enabling the debug flag -G0.

Anyone got any ideas on the cause of this?

Regards,

Dave

You can make a repro and do a bug report :)

Of course you may also show us the code of that kernel.

The code is currently of the sensitive nature. I will try narrow down the problem and post some code if I can. In the mean time anyone got a suggestion?

compiling your kernel for SM 1.2 or 1.3 will used the older Open64 based compiler still, whereas SM 2.0 and 2.1 will use the new LLVM based compiler (starting with CUDA 4.1)

I wonder if nVidia provides any option to also use Open64 for SM 2.x ?

Christian

It does. Run ‘nvcc … --nvvm’ to force it to use llvm, and ‘nvcc … --open64’ to force it to use open64.

Using the compiler option --open64 to force it to use open64 has fixed the problem.

In a nut shell what is the difference between the llvm and open64 compiler?

They entirely separate front ends. I would suggest filing a bug against the compiler, attaching a small self-contained repro case. Thank you for your help.

I now have actually fixed the problem, where it works now with both compiler options.

Previously this is what I was doing the following, which worked with open64 not llvm

....

if(threadIdx.x == 0)

    result = someFunction(data,size,buffer); // result and data are stored in global memory

else

    someFunction(data,size,buffer); 

....

__device__ float someFunction(float *data, int size, float *buffer)

{

// big calculation using the global data input

for(int i = threadIdx.x; i < size; i+= blockDim.x)

   buffer[i] = data[i] * ...

__syncthreads();

// finalise the result in one thread

if(threadIdx.x == 0)

{

   float result = ...

   return result;

}

else

{

   return 0;

}

}

After reviewing how I did the code I relised it was ugly and not the correct way, in which I changed it to the following, which now works in both the llvm and open64 options.

....

    result = someFunction(data,size,buffer); // result and data are stored in global memory

....

__device__ float someFunction(float *data, int size, float *buffer)

{

// big calculation using the global data input

for(int i = threadIdx.x; i < size; i+= blockDim.x)

   buffer[i] = data[i] * ...

__syncthreads();

__shared__ float result;

// finalise the result in one thread

if(threadIdx.x == 0)

{

   result = ...

}

__syncthreads();

return result;

}

Thanks for the help guys, I think the problem was all my dodgy coding style. This kernel was one of my first kernels I had written and was a little bit of a hack and slash approach.