Crash of kernel function with an array of device pointers as a parameter Unknown error after kernel

I’ve spent about a day trying to understand what is wrong with my kernel function, but have no idea.

Initially I had quite long and complicated function. I was trying to localize error, but I couldn’t. Changes in different parts of it sometimes lead to no error. But there were no logic in it.

After several hours of relaunching I’ve cut of everithing except those parts which keeps this kernel crash. Here it is:

#include <stdio.h>

__global__ void Kernel(float** nww, unsigned nww_w, unsigned nww_h, float tau) 

{

	unsigned c = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;

	unsigned r = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;

	if ((c < nww_w) && (r < nww_h)) {

		float  actw = 0;

		double lambda[2];

		double bestF = 10;

		double fdw(0);

		for (int i = 0; i < 2; ++i)  {

			int k = (i ? 0 : 1);

			int rhoNeg = 0;

			double dw = (rhoNeg ? -tau : tau) * lambda[k];

			double f = i ? 

				dw / (2*tau):

				dw / (2*tau);

			fdw = dw; 

			bestF = f;					

		}

		if (bestF < 10)		{

			actw = float(fdw);

		}

		nww[r][c] = 2*actw;

	}

}

void TestKernel()

{

	cudaSetDevice(0);

	//===============  Init matrix ====================

	int nHeight = 10;

	int nWidth = 10;

	float* pMatrixLinear = NULL;

	float** pMatrix = NULL;

	cudaMalloc((void**)&pMatrixLinear,nHeight*nWidth*sizeof(float));

	cudaMemset(pMatrixLinear, 0, nHeight*nWidth*sizeof(float));

	cudaMalloc((void**)&pMatrix,nHeight*sizeof(float*));

	float** tmpArray = new float*[nHeight];

	//Just assign the pointers to linear memory

	for (int i=0; i<nHeight; i++)	{

		tmpArray[i] = pMatrixLinear + i*nWidth;

	}

	//Copy this pointers to dev memory

	cudaMemcpy(pMatrix, tmpArray, nHeight*sizeof(float*), cudaMemcpyHostToDevice);

	delete[] tmpArray;

	//===============  End of Init matrix ====================

	dim3 blockSize(16, 16, 1);

	dim3 gridSize(1, 1,1);

	float tau = 0.5;

	Kernel<<<gridSize, blockSize>>>(pMatrix, nWidth, nHeight, tau);

	cudaError err;

	err = cudaGetLastError();   

	if(cudaSuccess != err){

		printf("UpdatePrimalKernel failed \n");

		exit(-1);

	}

	err = cudaThreadSynchronize();

	if(cudaSuccess != err){

		printf("cudaThreadSynchronize() failed \n");

		printf(cudaGetErrorString( err));

		printf("\n");

		exit(-1);

	}

	cudaFree(pMatrixLinear);

	cudaFree(pMatrix);

}

Output:

cudaThreadSynchronize() failed

unknown error

Press any key to continue . . .

I’ve attached this cu file for convenience. I’m using CUDA Toolkit ver 4.0.17, Win 7 64, Quadro 5000 and GeForce GTX 480 (tried on both).

Here is the command line of nvcc:

echo "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\\bin\nvcc.exe"    -gencode=arch=compute_10,code=\"sm_10,compute_10\" -gencode=arch=compute_20,code=\"sm_20,compute_20\"  --machine 64 -ccbin "c:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin"    -Xcompiler "/EHsc /W3 /nologo /Od /Zi   /MDd  "  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\\include" -maxrregcount=0  --compile -o "x64\Debug/TestClasses.vcproj.obj" TestClasses.vcproj 

           "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\\bin\nvcc.exe"    -gencode=arch=compute_10,code=\"sm_10,compute_10\" -gencode=arch=compute_20,code=\"sm_20,compute_20\"  --machine 64 -ccbin "c:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin"    -Xcompiler "/EHsc /W3 /nologo /Od /Zi   /MDd  "  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\\include" -maxrregcount=0  --compile -o "x64\Debug/TestClasses.vcproj.obj" "c:\Users\sirotenko\Documents\Visual Studio 2008\Projects\TestClasses\TestClasses\TestClasses.vcproj"

It’s almost default.

After all cuts this code have no sense, it’s just keep giving the error.

What I’ve found about this code:

  • Almost every change in kernel leads to no error. But I don’t see the explanation of this.

  • If I change internal variables from double to float - no error.

  • If I change nww variable from vector of pointers to linear array - no error.

  • If I compile it with GPU debug information - no error.
    kernel.cu (1.87 KB)

I can reproduce locally. Try using cuda-memcheck:

========= CUDA-MEMCHECK
calling kernel.
error error info unspecified launch failure error code 4
cudaThreadSynchronize() failed
unspecified launch failure
========= Invalid global read of size 8
========= at 0x000000f0 in Kernel
========= by thread (0,2,0) in block (0,0,0)
========= Address 0x103fffc30 is out of bounds

========= ERROR SUMMARY: 1 error

Maybe it helps.

Yes, I’ve tried this also. All I can see is that there is some problem with read from gloabl memory, but still have no idea why.

Looks like it some optimization issue. If I add the volatile keyword like this:

volatile double lambda[2];

it works also without errors. But still no answer why.

The issue you are seeing is due to a compiler bug. I have filed a bug report for this. Based on the analysis so far, the workaround you identified (declaring lambda volatile) is not robust. The addition of “volatile” just changes the code enough to side-step the real issue. A workaround that addresses the root cause is to declare the index variable “k” as volatile. I will caution that workarounds using “volatile” may have significant performance impact. But at least this should allow forward progress with your real application. Sorry for the inconvenience.

Please let me know if declaring the index variable volatile does not fix the problem in the context of your full application.

Thank you for your reply.

Setting k as volatile didn’t help for bigger kernel. As well as removing all ternary operators from the kernel.

So I’m going to find other way to implement it. It would be helpfull if you give an idea which constucts in kernel should I avoid, until this bug will be fixed.

The specific issue in your repro case is that portions of the computation of “k” from “i” are getting incorrectly combined with the computation of the address &lambda[k] from “k”. Declaring “k” as volatile prevents this merging by ensuring the code first finishes the computation of “k”, then computes the address &lambda[k] from “k”. The optimization in question is applied by PTXAS. One thing you could try as a workaround is to lower PTXAS optimization level from the default of -O3 to -O1 (this fixes the problem in your repro case). To do so, use nvcc commandline arguments -Xptxas -O1. Obviously there could be a potentially significant drop in performance due to this change.

I caution that use of component-level controls such as this one is generally unsupported. They are useful for experiments and debugging of compiler issues but I would strongly advise against their use in production code.

Just to let you know. This switch is works for my big kernel.