CUDA grid launch failed error

Madeeks · March 11, 2011, 4:34pm

Hello, I’m trying to debug a kernel on my application, but after execution correctly stops at a breakpoint placed midway through the kernel, if I let the execution continue, the Parallel Nsight Debug returns the error “CUDA grid launch failed” and execution abruptly ends (causing me to have to reboot the target machine).

Here is the kernel:

__global__ void proj_Kernel_for(prec_type *p, struct sba_crsm idxij, prec_type *hx, const int cnp, const int pnp, const int mnp, prec_type d_adata[])

{

	prec_type *pb, paj[PROJ_CNP], pbi[PROJ_PNP], pxij[PROJ_MNP];

	int g_idx = threadIdx.x + blockIdx.x * blockDim.x;

	int n, m, nvis;

	int strides, stage, count;

	int t_idx;

        unsigned int val, i=0, j;

	extern __shared__ unsigned int row_idx[];

	

	n=idxij.nr;

	m=idxij.nc;

	nvis = idxij.rowptr[n];

	pb=p+m*cnp;

	

	strides = (n+1)/blockDim.x;

	//Load shared array rowptr

	for (stage=0; stage<=strides; ++stage){

		t_idx = threadIdx.x + stage * blockDim.x;

		if (t_idx < n+1)

			row_idx[t_idx] = idxij.rowptr[t_idx];

	}

	__syncthreads();

	if (g_idx < nvis) //Breakpoint is placed in this line

	{

		j = idxij.colidx[g_idx];

		val = idxij.val[g_idx];

		for (count=0; count<n+1; ++count){

			if (row_idx[count] <= val)

				i = count;

		}

		for (count=0; count<cnp; ++count){

			paj[count] = p[j*cnp+count];

		}

		for (count=0; count<pnp; ++count){

			pbi[count] = pb[i*pnp+count];

		}

		

	}

	__syncthreads();

	if (g_idx < nvis)

	{

		imgproj(j, i, paj, pbi, pxij, d_adata);

	}

	__syncthreads();

	if (g_idx < nvis)

	{

		for (count=0; count<mnp; ++count){

			hx[val*mnp+count] = pxij[count];

		}

	}

}

The imgproj function is a device function defined in an #include-d header, and here it goes:

__device__ __forceinline__ void imgproj(int j, int i, prec_type rt[], prec_type xyz[], prec_type m[], prec_type adata[])

{

prec_type *qr0, *r0=adata, *a=adata+28;

prec_type t1, t2, t3, t5, t10, t16, t22, t24, t30, t34, t39, t44, t52, t58, t61;

qr0=r0+j*4;

t1 = pow(rt[0], (prec_type)0.2e1);

  t2 = pow(rt[1], (prec_type)0.2e1);

  t3 = pow(rt[2], (prec_type)0.2e1);

  t5 = sqrt(0.1e1 - t1 - t2 - t3);

  t10 = t5 * qr0[1] + qr0[0] * rt[0] + rt[1] * qr0[3] - rt[2] * qr0[2];

  t16 = t5 * qr0[2] + qr0[0] * rt[1] + rt[2] * qr0[1] - rt[0] * qr0[3];

  t22 = t5 * qr0[3] + qr0[0] * rt[2] + rt[0] * qr0[2] - rt[1] * qr0[1];

  t24 = -t10 * xyz[0] - t16 * xyz[1] - t22 * xyz[2];

  t30 = t5 * qr0[0] - rt[0] * qr0[1] - rt[1] * qr0[2] - rt[2] * qr0[3];

  t34 = t30 * xyz[0] + t16 * xyz[2] - t22 * xyz[1];

  t39 = t30 * xyz[1] + t22 * xyz[0] - t10 * xyz[2];

  t44 = t30 * xyz[2] + t10 * xyz[1] - t16 * xyz[0];

  t52 = -t24 * t16 + t30 * t39 - t44 * t10 + t34 * t22 + rt[4];

  t58 = -t24 * t22 + t30 * t44 - t34 * t16 + t39 * t10 + rt[5];

  t61 = 0.1e1 / t58;

  m[0] = (a[0] * (-t24 * t10 + t30 * t34 - t39 * t22 + t44 * t16 + rt[3]) + a[1] * t52 + a[2] * t58) * t61;

  m[1] = (a[3] * t52 + a[4] * t58) * t61;

}

Finally, here’s a code snippet of the launch configuration of the kernel:

dim3 proj_block(32);

    dim3 proj_grid((nvis/32)+1);

    cudaEventRecord(for_start, 0);

    proj_Kernel_for<<<proj_grid, proj_block, (n+1)*sizeof(int)>>>(d_p, d_idxij, d_hx, cnp, pnp, mnp, d_adata);

    HANDLE_ERROR ( cudaThreadSynchronize() );

    cudaEventRecord(for_stop, 0);

EDIT: I forgot to point out that prec_type is a data type equivalent to float, on this case.

I’m quite sure the issue is in something related to the kernel, as I verified through host debugging that everything before it is correct.

I run this application on a target machine with a GeForce GTX 470, 8GB of RAM, an Intel Core 2 Quad Q9550 CPU and Windows 7 Ultimate 64 bit.

Interesting thing is that I encountered this error before noticing I was launching the kernel with an obviously wrong configuration (blocks and total threads on the grid weren’t sufficient for the correct computation). After fixing that, I was able to correctly hit a breakpoint inside the device function. I then made a change to that function and the problem showed again, and this time persisted even when I canceled the changes and brought back the function to the previously “correct” state.

I tried various solutions, but I always ended up with the “grid launch failed” error, which disorients me, as I think that if I break inside the kernel, the grid should already be launched, right?

Can anyone give me some additional insight in this issue? I’ve been stuck on it for almost the last 2 days and I’m starting to get clueless.

One last thing: I’m a beginner developer and this is my first “serious” CUDA C program, so if you notice any trivial mistake, please let me know, as I clearly wasn’t able to notice/work it out on my own.

Thanks for any help provided,

Madeeks

Topic		Replies	Views
NSIGHT Debug CUDA grid launch failed CUDA Programming and Performance	0	419	March 31, 2021
LaunchGrid issue. failure after successful LaunchGrid. CUDA Programming and Performance	5	2795	May 14, 2008
problem launching kernel with cuLaunchGrid CUDA Programming and Performance	2	3937	July 15, 2009
curand_init - Grid Launch Error CUDA Programming and Performance	8	1089	June 28, 2017
Problem launching kernel with driverapi CUDA Programming and Performance	1	1442	April 7, 2009
cutilCheckMsg("kernel launch failure"); unknown error. CUDA Programming and Performance	1	1352	October 27, 2010
Grid Launch Failure Issues Nsight Visual Studio Edition	0	588	January 11, 2020
GPU CUDA problem: CUDA grid launch failed error on windows CUDA Programming and Performance	2	1796	November 10, 2017
CUDA grid launch failed Nsight Visual Studio Edition	3	4759	March 23, 2018
CUDA grid launch failed error for gtx1080ti, vs2015, win10, nsight5.2 Nsight Visual Studio Edition	6	2623	June 13, 2017

CUDA grid launch failed error

Related topics