I just updated to cuda version 4.1.28, updating the following products / drivers:
cuda took kit 4.1.28
dev driver windows 7 64 286.16
gpu sdk 4.1.28
parallel nsight 2.1
Since updating one of my kernels keeps failing with just rubbish output, where it is not displaying any errors or the cuda runtime does not report any errors. The rest of my kernels run fine, it is just one that is causing an issue, which had ran flawlessly in cuda 4.0.17.
I did some testing and was able to get the kernel to produce the correct result, by enabling the debug flag -G0.
The code is currently of the sensitive nature. I will try narrow down the problem and post some code if I can. In the mean time anyone got a suggestion?
compiling your kernel for SM 1.2 or 1.3 will used the older Open64 based compiler still, whereas SM 2.0 and 2.1 will use the new LLVM based compiler (starting with CUDA 4.1)
I wonder if nVidia provides any option to also use Open64 for SM 2.x ?
They entirely separate front ends. I would suggest filing a bug against the compiler, attaching a small self-contained repro case. Thank you for your help.
I now have actually fixed the problem, where it works now with both compiler options.
Previously this is what I was doing the following, which worked with open64 not llvm
....
if(threadIdx.x == 0)
result = someFunction(data,size,buffer); // result and data are stored in global memory
else
someFunction(data,size,buffer);
....
__device__ float someFunction(float *data, int size, float *buffer)
{
// big calculation using the global data input
for(int i = threadIdx.x; i < size; i+= blockDim.x)
buffer[i] = data[i] * ...
__syncthreads();
// finalise the result in one thread
if(threadIdx.x == 0)
{
float result = ...
return result;
}
else
{
return 0;
}
}
After reviewing how I did the code I relised it was ugly and not the correct way, in which I changed it to the following, which now works in both the llvm and open64 options.
....
result = someFunction(data,size,buffer); // result and data are stored in global memory
....
__device__ float someFunction(float *data, int size, float *buffer)
{
// big calculation using the global data input
for(int i = threadIdx.x; i < size; i+= blockDim.x)
buffer[i] = data[i] * ...
__syncthreads();
__shared__ float result;
// finalise the result in one thread
if(threadIdx.x == 0)
{
result = ...
}
__syncthreads();
return result;
}
Thanks for the help guys, I think the problem was all my dodgy coding style. This kernel was one of my first kernels I had written and was a little bit of a hack and slash approach.