cudaErrorLaunchFailure from custom kernel

This is a module calculating the integral image by NPP.
But it returns cudaErrorLaunchFailure, i don’t know the reason.
Please help me.

Host Code

void CreateIntegralImage64ByNpp(BYTE* pbImage, DWORD* pdwIntImage, __int64* pn64SqrIntImage, int nIntH, int nIntW)
	NppStatus xStatus;
	Npp8u* p8uImage;
	Npp32f* p32fIntImage;
	Npp64f* p64fSqrIntImage;
	NppiSize xSize;
	xSize.height = nIntH - 1;
	xSize.width = nIntW - 1;
	size_t nPitch1, nPitch2, nPitch3;
	cudaMallocPitch(&p8uImage, &nPitch1, (nIntW - 1) * sizeof(Npp8u), nIntH - 1);
	cudaMemcpy2D(p8uImage, nPitch1, pbImage, nIntW - 1, (nIntW - 1) * sizeof(Npp8u), nIntH - 1, cudaMemcpyDeviceToDevice);
	cudaMallocPitch(&p32fIntImage, &nPitch2, nIntW * sizeof(Npp32f), nIntH);
	cudaMallocPitch(&p64fSqrIntImage, &nPitch3, nIntW * sizeof(Npp64f), nIntH);
	xStatus = nppiSqrIntegral_8u32f64f_C1R(p8uImage, nPitch1, p32fIntImage, nPitch2, p64fSqrIntImage, nPitch3, xSize, 0, 0);
	cudaError_t error = cudaGetLastError();
	dim3 Block(32,32);
	dim3 Grid((nIntW + Block.x - 1)/Block.x, (nIntH + Block.y - 1)/Block.y);
	convertValue<<<Grid, Block>>>(pdwIntImage, pn64SqrIntImage, p32fIntImage, nPitch2, p64fSqrIntImage, nPitch3, nIntH, nIntW);

	error = cudaGetLastError();// return cudaErrorLaunchFailure at here

Device Code

__global__ void convertValue(DWORD* pdwIntImage, __int64* pn64SqrIntImage, Npp32f* pn32fIntImage, int nPitch1, Npp64f* pn64fSqrIntImage, int nPitch2, int nIntH, int nIntW)
	int nX = blockIdx.x * blockDim.x + threadIdx.x;
	int nY = blockIdx.y * blockDim.y + threadIdx.y;
	if ((nX > nIntW - 1) || (nY > nIntH - 1))
	pdwIntImage[nY * nIntW + nX] = (DWORD)pn32fIntImage[nPitch1 * nY + nX];
	pn64SqrIntImage[nY * nIntW + nX] = (__int64)pn64fSqrIntImage[nPitch2 * nY + nX];

what happens when you run it with cuda-memcheck?

I have never used cuda-memcheck.

The most likely cause of the kernel launch failure is a bug in your code, such as an out-of-bounds memory access. cuda-memcheck is a very handy debugging tool that helps to detect out-of-bounds memory accesses and race conditions, so this seems like an excellent opportunity to familiarize yourself with it. In the simplest possible invocation you just use

cuda-memcheck [executable name] [executable command line arguments]

Thank you all.
I tried cuda-memcheck about my application.
But it tells information non-effective like cudalaunchfailure from cudaGetLastError.
Nothing about doubtful kernel convertValue.
What can i do then?
please, ask for much help

cuda-memcheck should be used on debug binaries, so it can give meaningful information like the line number (and sourcecode file name) of any detected failures.

Typically this means enabling debug output with -g and disabling optimization with -O0.

Does your CUDA device support block sizes of 1024 threads? what compute capability do you target?

Thanks cbuchner1.
My gpu card is Geforce GTX750, so it supports 1024 threads per block, and cc is 5.0.

Hi all.
In above code, error happened at line of pdwIntImage[nY * nIntW + nX] = (DWORD)pn32fIntImage[nPitch1 * nY + nX]
and the line of cudamemcpy2D in kernel.
I tested it by cuda-memcheck.
first error message is Invalid global read of size 4.
and second is “Program hit cudaErrorInvalidPitchValue (error 12) due to “invalid pitch argument” on CUDA API call to cudaMemcpy2D”
So i think that (DWORD)pn32fIntImage[nPitch1 * nY + nX] is invalid about any pair of nX and nY about first error.
and have no sense about second error.
why error happens?
Please help me.