cudaErrorLaunchFailure from custom kernel

cheer37 · October 15, 2014, 2:35pm

This is a module calculating the integral image by NPP.
But it returns cudaErrorLaunchFailure, i don’t know the reason.
Please help me.

Host Code

void CreateIntegralImage64ByNpp(BYTE* pbImage, DWORD* pdwIntImage, __int64* pn64SqrIntImage, int nIntH, int nIntW)
{
	NppStatus xStatus;
	Npp8u* p8uImage;
	Npp32f* p32fIntImage;
	Npp64f* p64fSqrIntImage;
	NppiSize xSize;
	xSize.height = nIntH - 1;
	xSize.width = nIntW - 1;
	size_t nPitch1, nPitch2, nPitch3;
	cudaMallocPitch(&p8uImage, &nPitch1, (nIntW - 1) * sizeof(Npp8u), nIntH - 1);
	cudaMemcpy2D(p8uImage, nPitch1, pbImage, nIntW - 1, (nIntW - 1) * sizeof(Npp8u), nIntH - 1, cudaMemcpyDeviceToDevice);
	cudaMallocPitch(&p32fIntImage, &nPitch2, nIntW * sizeof(Npp32f), nIntH);
	cudaMallocPitch(&p64fSqrIntImage, &nPitch3, nIntW * sizeof(Npp64f), nIntH);
	xStatus = nppiSqrIntegral_8u32f64f_C1R(p8uImage, nPitch1, p32fIntImage, nPitch2, p64fSqrIntImage, nPitch3, xSize, 0, 0);
	cudaDeviceSynchronize();
	cudaError_t error = cudaGetLastError();
	dim3 Block(32,32);
	dim3 Grid((nIntW + Block.x - 1)/Block.x, (nIntH + Block.y - 1)/Block.y);
	convertValue<<<Grid, Block>>>(pdwIntImage, pn64SqrIntImage, p32fIntImage, nPitch2, p64fSqrIntImage, nPitch3, nIntH, nIntW);

	cudaDeviceSynchronize();
	error = cudaGetLastError();// return cudaErrorLaunchFailure at here
	cudaFree(p8uImage);
	cudaFree(p32fIntImage);
	cudaFree(p64fSqrIntImage);
}

Device Code

__global__ void convertValue(DWORD* pdwIntImage, __int64* pn64SqrIntImage, Npp32f* pn32fIntImage, int nPitch1, Npp64f* pn64fSqrIntImage, int nPitch2, int nIntH, int nIntW)
{
	int nX = blockIdx.x * blockDim.x + threadIdx.x;
	int nY = blockIdx.y * blockDim.y + threadIdx.y;
	if ((nX > nIntW - 1) || (nY > nIntH - 1))
	{
		return;
	}
	pdwIntImage[nY * nIntW + nX] = (DWORD)pn32fIntImage[nPitch1 * nY + nX];
	pn64SqrIntImage[nY * nIntW + nX] = (__int64)pn64fSqrIntImage[nPitch2 * nY + nX];
}

Robert_Crovella · October 16, 2014, 1:40am

what happens when you run it with cuda-memcheck?

cheer37 · October 16, 2014, 2:32am

I have never used cuda-memcheck.

njuffa · October 16, 2014, 2:00pm

The most likely cause of the kernel launch failure is a bug in your code, such as an out-of-bounds memory access. cuda-memcheck is a very handy debugging tool that helps to detect out-of-bounds memory accesses and race conditions, so this seems like an excellent opportunity to familiarize yourself with it. In the simplest possible invocation you just use

cuda-memcheck [executable name] [executable command line arguments]

cheer37 · October 16, 2014, 3:31pm

Thank you all.
I tried cuda-memcheck about my application.
But it tells information non-effective like cudalaunchfailure from cudaGetLastError.
Nothing about doubtful kernel convertValue.
What can i do then?
please, ask for much help

cbuchner1 · October 16, 2014, 3:49pm

cuda-memcheck should be used on debug binaries, so it can give meaningful information like the line number (and sourcecode file name) of any detected failures.

Typically this means enabling debug output with -g and disabling optimization with -O0.

cbuchner1 · October 16, 2014, 3:51pm

Does your CUDA device support block sizes of 1024 threads? what compute capability do you target?

cheer37 · October 16, 2014, 4:13pm

Thanks cbuchner1.
My gpu card is Geforce GTX750, so it supports 1024 threads per block, and cc is 5.0.

cheer37 · October 20, 2014, 3:39pm

Hi all.
In above code, error happened at line of pdwIntImage[nY * nIntW + nX] = (DWORD)pn32fIntImage[nPitch1 * nY + nX]
and the line of cudamemcpy2D in kernel.
I tested it by cuda-memcheck.
first error message is Invalid global read of size 4.
and second is “Program hit cudaErrorInvalidPitchValue (error 12) due to “invalid pitch argument” on CUDA API call to cudaMemcpy2D”
So i think that (DWORD)pn32fIntImage[nPitch1 * nY + nX] is invalid about any pair of nX and nY about first error.
and have no sense about second error.
why error happens?
Please help me.