I’ve spent about a day trying to understand what is wrong with my kernel function, but have no idea.
Initially I had quite long and complicated function. I was trying to localize error, but I couldn’t. Changes in different parts of it sometimes lead to no error. But there were no logic in it.
After several hours of relaunching I’ve cut of everithing except those parts which keeps this kernel crash. Here it is:
#include <stdio.h>
__global__ void Kernel(float** nww, unsigned nww_w, unsigned nww_h, float tau)
{
unsigned c = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
unsigned r = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;
if ((c < nww_w) && (r < nww_h)) {
float actw = 0;
double lambda[2];
double bestF = 10;
double fdw(0);
for (int i = 0; i < 2; ++i) {
int k = (i ? 0 : 1);
int rhoNeg = 0;
double dw = (rhoNeg ? -tau : tau) * lambda[k];
double f = i ?
dw / (2*tau):
dw / (2*tau);
fdw = dw;
bestF = f;
}
if (bestF < 10) {
actw = float(fdw);
}
nww[r][c] = 2*actw;
}
}
void TestKernel()
{
cudaSetDevice(0);
//=============== Init matrix ====================
int nHeight = 10;
int nWidth = 10;
float* pMatrixLinear = NULL;
float** pMatrix = NULL;
cudaMalloc((void**)&pMatrixLinear,nHeight*nWidth*sizeof(float));
cudaMemset(pMatrixLinear, 0, nHeight*nWidth*sizeof(float));
cudaMalloc((void**)&pMatrix,nHeight*sizeof(float*));
float** tmpArray = new float*[nHeight];
//Just assign the pointers to linear memory
for (int i=0; i<nHeight; i++) {
tmpArray[i] = pMatrixLinear + i*nWidth;
}
//Copy this pointers to dev memory
cudaMemcpy(pMatrix, tmpArray, nHeight*sizeof(float*), cudaMemcpyHostToDevice);
delete[] tmpArray;
//=============== End of Init matrix ====================
dim3 blockSize(16, 16, 1);
dim3 gridSize(1, 1,1);
float tau = 0.5;
Kernel<<<gridSize, blockSize>>>(pMatrix, nWidth, nHeight, tau);
cudaError err;
err = cudaGetLastError();
if(cudaSuccess != err){
printf("UpdatePrimalKernel failed \n");
exit(-1);
}
err = cudaThreadSynchronize();
if(cudaSuccess != err){
printf("cudaThreadSynchronize() failed \n");
printf(cudaGetErrorString( err));
printf("\n");
exit(-1);
}
cudaFree(pMatrixLinear);
cudaFree(pMatrix);
}
Output:
cudaThreadSynchronize() failed
unknown error
Press any key to continue . . .
I’ve attached this cu file for convenience. I’m using CUDA Toolkit ver 4.0.17, Win 7 64, Quadro 5000 and GeForce GTX 480 (tried on both).
Here is the command line of nvcc:
echo "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\\bin\nvcc.exe" -gencode=arch=compute_10,code=\"sm_10,compute_10\" -gencode=arch=compute_20,code=\"sm_20,compute_20\" --machine 64 -ccbin "c:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin" -Xcompiler "/EHsc /W3 /nologo /Od /Zi /MDd " -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\\include" -maxrregcount=0 --compile -o "x64\Debug/TestClasses.vcproj.obj" TestClasses.vcproj
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\\bin\nvcc.exe" -gencode=arch=compute_10,code=\"sm_10,compute_10\" -gencode=arch=compute_20,code=\"sm_20,compute_20\" --machine 64 -ccbin "c:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin" -Xcompiler "/EHsc /W3 /nologo /Od /Zi /MDd " -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\\include" -maxrregcount=0 --compile -o "x64\Debug/TestClasses.vcproj.obj" "c:\Users\sirotenko\Documents\Visual Studio 2008\Projects\TestClasses\TestClasses\TestClasses.vcproj"
It’s almost default.
After all cuts this code have no sense, it’s just keep giving the error.
What I’ve found about this code:
-
Almost every change in kernel leads to no error. But I don’t see the explanation of this.
-
If I change internal variables from double to float - no error.
-
If I change nww variable from vector of pointers to linear array - no error.
-
If I compile it with GPU debug information - no error.
kernel.cu (1.87 KB)