Crash of kernel function with an array of device pointers as a parameter Unknown error after kernel

sirotenko · October 24, 2011, 4:19pm

I’ve spent about a day trying to understand what is wrong with my kernel function, but have no idea.

Initially I had quite long and complicated function. I was trying to localize error, but I couldn’t. Changes in different parts of it sometimes lead to no error. But there were no logic in it.

After several hours of relaunching I’ve cut of everithing except those parts which keeps this kernel crash. Here it is:

#include <stdio.h>

__global__ void Kernel(float** nww, unsigned nww_w, unsigned nww_h, float tau) 

{

	unsigned c = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;

	unsigned r = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;

	if ((c < nww_w) && (r < nww_h)) {

		float  actw = 0;

		double lambda[2];

		double bestF = 10;

		double fdw(0);

		for (int i = 0; i < 2; ++i)  {

			int k = (i ? 0 : 1);

			int rhoNeg = 0;

			double dw = (rhoNeg ? -tau : tau) * lambda[k];

			double f = i ? 

				dw / (2*tau):

				dw / (2*tau);

			fdw = dw; 

			bestF = f;					

		}

		if (bestF < 10)		{

			actw = float(fdw);

		}

		nww[r][c] = 2*actw;

	}

}

void TestKernel()

{

	cudaSetDevice(0);

	//===============  Init matrix ====================

	int nHeight = 10;

	int nWidth = 10;

	float* pMatrixLinear = NULL;

	float** pMatrix = NULL;

	cudaMalloc((void**)&pMatrixLinear,nHeight*nWidth*sizeof(float));

	cudaMemset(pMatrixLinear, 0, nHeight*nWidth*sizeof(float));

	cudaMalloc((void**)&pMatrix,nHeight*sizeof(float*));

	float** tmpArray = new float*[nHeight];

	//Just assign the pointers to linear memory

	for (int i=0; i<nHeight; i++)	{

		tmpArray[i] = pMatrixLinear + i*nWidth;

	}

	//Copy this pointers to dev memory

	cudaMemcpy(pMatrix, tmpArray, nHeight*sizeof(float*), cudaMemcpyHostToDevice);

	delete[] tmpArray;

	//===============  End of Init matrix ====================

	dim3 blockSize(16, 16, 1);

	dim3 gridSize(1, 1,1);

	float tau = 0.5;

	Kernel<<<gridSize, blockSize>>>(pMatrix, nWidth, nHeight, tau);

	cudaError err;

	err = cudaGetLastError();   

	if(cudaSuccess != err){

		printf("UpdatePrimalKernel failed \n");

		exit(-1);

	}

	err = cudaThreadSynchronize();

	if(cudaSuccess != err){

		printf("cudaThreadSynchronize() failed \n");

		printf(cudaGetErrorString( err));

		printf("\n");

		exit(-1);

	}

	cudaFree(pMatrixLinear);

	cudaFree(pMatrix);

}

Output:

cudaThreadSynchronize() failed

unknown error

Press any key to continue . . .

I’ve attached this cu file for convenience. I’m using CUDA Toolkit ver 4.0.17, Win 7 64, Quadro 5000 and GeForce GTX 480 (tried on both).

Here is the command line of nvcc:

echo "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\\bin\nvcc.exe"    -gencode=arch=compute_10,code=\"sm_10,compute_10\" -gencode=arch=compute_20,code=\"sm_20,compute_20\"  --machine 64 -ccbin "c:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin"    -Xcompiler "/EHsc /W3 /nologo /Od /Zi   /MDd  "  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\\include" -maxrregcount=0  --compile -o "x64\Debug/TestClasses.vcproj.obj" TestClasses.vcproj 

           "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\\bin\nvcc.exe"    -gencode=arch=compute_10,code=\"sm_10,compute_10\" -gencode=arch=compute_20,code=\"sm_20,compute_20\"  --machine 64 -ccbin "c:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin"    -Xcompiler "/EHsc /W3 /nologo /Od /Zi   /MDd  "  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\\include" -maxrregcount=0  --compile -o "x64\Debug/TestClasses.vcproj.obj" "c:\Users\sirotenko\Documents\Visual Studio 2008\Projects\TestClasses\TestClasses\TestClasses.vcproj"

It’s almost default.

After all cuts this code have no sense, it’s just keep giving the error.

What I’ve found about this code:

Almost every change in kernel leads to no error. But I don’t see the explanation of this.
If I change internal variables from double to float - no error.
If I change nww variable from vector of pointers to linear array - no error.
If I compile it with GPU debug information - no error.
kernel.cu (1.87 KB)

cuuuuuda · October 24, 2011, 8:41pm

I can reproduce locally. Try using cuda-memcheck:

========= CUDA-MEMCHECK
calling kernel.
error error info unspecified launch failure error code 4
cudaThreadSynchronize() failed
unspecified launch failure
========= Invalid global read of size 8
========= at 0x000000f0 in Kernel
========= by thread (0,2,0) in block (0,0,0)
========= Address 0x103fffc30 is out of bounds

========= ERROR SUMMARY: 1 error

Maybe it helps.

sirotenko · October 25, 2011, 6:29am

Yes, I’ve tried this also. All I can see is that there is some problem with read from gloabl memory, but still have no idea why.

sirotenko · October 25, 2011, 6:46am

Looks like it some optimization issue. If I add the volatile keyword like this:

volatile double lambda[2];

it works also without errors. But still no answer why.

njuffa · October 25, 2011, 11:33pm

The issue you are seeing is due to a compiler bug. I have filed a bug report for this. Based on the analysis so far, the workaround you identified (declaring lambda volatile) is not robust. The addition of “volatile” just changes the code enough to side-step the real issue. A workaround that addresses the root cause is to declare the index variable “k” as volatile. I will caution that workarounds using “volatile” may have significant performance impact. But at least this should allow forward progress with your real application. Sorry for the inconvenience.

Please let me know if declaring the index variable volatile does not fix the problem in the context of your full application.

sirotenko · October 26, 2011, 1:49pm

Thank you for your reply.

Setting k as volatile didn’t help for bigger kernel. As well as removing all ternary operators from the kernel.

So I’m going to find other way to implement it. It would be helpfull if you give an idea which constucts in kernel should I avoid, until this bug will be fixed.

njuffa · October 26, 2011, 5:18pm

The specific issue in your repro case is that portions of the computation of “k” from “i” are getting incorrectly combined with the computation of the address &lambda[k] from “k”. Declaring “k” as volatile prevents this merging by ensuring the code first finishes the computation of “k”, then computes the address &lambda[k] from “k”. The optimization in question is applied by PTXAS. One thing you could try as a workaround is to lower PTXAS optimization level from the default of -O3 to -O1 (this fixes the problem in your repro case). To do so, use nvcc commandline arguments -Xptxas -O1. Obviously there could be a potentially significant drop in performance due to this change.

I caution that use of component-level controls such as this one is generally unsupported. They are useful for experiments and debugging of compiler issues but I would strongly advise against their use in production code.

sirotenko · October 28, 2011, 7:36am

The specific issue in your repro case is that portions of the computation of “k” from “i” are getting incorrectly combined with the computation of the address &lambda[k] from “k”. Declaring “k” as volatile prevents this merging by ensuring the code first finishes the computation of “k”, then computes the address &lambda[k] from “k”. The optimization in question is applied by PTXAS. One thing you could try as a workaround is to lower PTXAS optimization level from the default of -O3 to -O1 (this fixes the problem in your repro case). To do so, use nvcc commandline arguments -Xptxas -O1. Obviously there could be a potentially significant drop in performance due to this change.

I caution that use of component-level controls such as this one is generally unsupported. They are useful for experiments and debugging of compiler issues but I would strongly advise against their use in production code.

Just to let you know. This switch is works for my big kernel.

Topic		Replies	Views
Crash in kernel function CUDA Programming and Performance	5	729	August 1, 2011
Strange crashes in __device__ function CUDA Programming and Performance	4	1051	August 1, 2016
1080 does not support doubles? CUDA Programming and Performance	5	552	October 30, 2018
Cuda code performance CUDA Programming and Performance	14	3101	December 16, 2014
CUDA 1.1 Bug - Compiler crash (ptxas) w/repro CUDA Programming and Performance	16	8598	May 19, 2008
Illegal memory access crash CUDA Programming and Performance	15	4371	January 30, 2022
Silent kernel failure CUDA Programming and Performance	25	8248	May 18, 2020
Using unified memory causes system crash CUDA Programming and Performance	28	5788	February 4, 2019
Cuda application crashes works fine for small data and crashes for big data CUDA Developer Tools	0	360	December 8, 2020
Always got this warning when nvprof cuda file "This can happen if device ran out of memory or if a device kernel was stopped due to an assertion" on just HellowWorld GPU CUDA Programming and Performance	9	2552	January 31, 2019

Crash of kernel function with an array of device pointers as a parameter Unknown error after kernel

Related topics