CUDA grid launch failed error

Hello, I’m trying to debug a kernel on my application, but after execution correctly stops at a breakpoint placed midway through the kernel, if I let the execution continue, the Parallel Nsight Debug returns the error “CUDA grid launch failed” and execution abruptly ends (causing me to have to reboot the target machine).

Here is the kernel:

__global__ void proj_Kernel_for(prec_type *p, struct sba_crsm idxij, prec_type *hx, const int cnp, const int pnp, const int mnp, prec_type d_adata[])


	prec_type *pb, paj[PROJ_CNP], pbi[PROJ_PNP], pxij[PROJ_MNP];

	int g_idx = threadIdx.x + blockIdx.x * blockDim.x;

	int n, m, nvis;

	int strides, stage, count;

	int t_idx;

        unsigned int val, i=0, j;

	extern __shared__ unsigned int row_idx[];;;

	nvis = idxij.rowptr[n];



	strides = (n+1)/blockDim.x;

	//Load shared array rowptr

	for (stage=0; stage<=strides; ++stage){

		t_idx = threadIdx.x + stage * blockDim.x;

		if (t_idx < n+1)

			row_idx[t_idx] = idxij.rowptr[t_idx];



	if (g_idx < nvis) //Breakpoint is placed in this line


		j = idxij.colidx[g_idx];

		val = idxij.val[g_idx];

		for (count=0; count<n+1; ++count){

			if (row_idx[count] <= val)

				i = count;


		for (count=0; count<cnp; ++count){

			paj[count] = p[j*cnp+count];


		for (count=0; count<pnp; ++count){

			pbi[count] = pb[i*pnp+count];





	if (g_idx < nvis)


		imgproj(j, i, paj, pbi, pxij, d_adata);



	if (g_idx < nvis)


		for (count=0; count<mnp; ++count){

			hx[val*mnp+count] = pxij[count];




The imgproj function is a device function defined in an #include-d header, and here it goes:

__device__ __forceinline__ void imgproj(int j, int i, prec_type rt[], prec_type xyz[], prec_type m[], prec_type adata[])


prec_type *qr0, *r0=adata, *a=adata+28;

prec_type t1, t2, t3, t5, t10, t16, t22, t24, t30, t34, t39, t44, t52, t58, t61;


t1 = pow(rt[0], (prec_type)0.2e1);

  t2 = pow(rt[1], (prec_type)0.2e1);

  t3 = pow(rt[2], (prec_type)0.2e1);

  t5 = sqrt(0.1e1 - t1 - t2 - t3);

  t10 = t5 * qr0[1] + qr0[0] * rt[0] + rt[1] * qr0[3] - rt[2] * qr0[2];

  t16 = t5 * qr0[2] + qr0[0] * rt[1] + rt[2] * qr0[1] - rt[0] * qr0[3];

  t22 = t5 * qr0[3] + qr0[0] * rt[2] + rt[0] * qr0[2] - rt[1] * qr0[1];

  t24 = -t10 * xyz[0] - t16 * xyz[1] - t22 * xyz[2];

  t30 = t5 * qr0[0] - rt[0] * qr0[1] - rt[1] * qr0[2] - rt[2] * qr0[3];

  t34 = t30 * xyz[0] + t16 * xyz[2] - t22 * xyz[1];

  t39 = t30 * xyz[1] + t22 * xyz[0] - t10 * xyz[2];

  t44 = t30 * xyz[2] + t10 * xyz[1] - t16 * xyz[0];

  t52 = -t24 * t16 + t30 * t39 - t44 * t10 + t34 * t22 + rt[4];

  t58 = -t24 * t22 + t30 * t44 - t34 * t16 + t39 * t10 + rt[5];

  t61 = 0.1e1 / t58;

  m[0] = (a[0] * (-t24 * t10 + t30 * t34 - t39 * t22 + t44 * t16 + rt[3]) + a[1] * t52 + a[2] * t58) * t61;

  m[1] = (a[3] * t52 + a[4] * t58) * t61;


Finally, here’s a code snippet of the launch configuration of the kernel:

dim3 proj_block(32);

    dim3 proj_grid((nvis/32)+1);

    cudaEventRecord(for_start, 0);

    proj_Kernel_for<<<proj_grid, proj_block, (n+1)*sizeof(int)>>>(d_p, d_idxij, d_hx, cnp, pnp, mnp, d_adata);

    HANDLE_ERROR ( cudaThreadSynchronize() );

    cudaEventRecord(for_stop, 0);

EDIT: I forgot to point out that prec_type is a data type equivalent to float, on this case.

I’m quite sure the issue is in something related to the kernel, as I verified through host debugging that everything before it is correct.

I run this application on a target machine with a GeForce GTX 470, 8GB of RAM, an Intel Core 2 Quad Q9550 CPU and Windows 7 Ultimate 64 bit.

Interesting thing is that I encountered this error before noticing I was launching the kernel with an obviously wrong configuration (blocks and total threads on the grid weren’t sufficient for the correct computation). After fixing that, I was able to correctly hit a breakpoint inside the device function. I then made a change to that function and the problem showed again, and this time persisted even when I canceled the changes and brought back the function to the previously “correct” state.

I tried various solutions, but I always ended up with the “grid launch failed” error, which disorients me, as I think that if I break inside the kernel, the grid should already be launched, right?

Can anyone give me some additional insight in this issue? I’ve been stuck on it for almost the last 2 days and I’m starting to get clueless.

One last thing: I’m a beginner developer and this is my first “serious” CUDA C program, so if you notice any trivial mistake, please let me know, as I clearly wasn’t able to notice/work it out on my own.

Thanks for any help provided,