NPP - nppiFilter_8u_C1R returns KERNEL_EXECUTION Debug options?

(Installation is ok: Example cuda programs work fine)

I am trying to perform simple linear filtering on a 512x512 uint8 image with a 3x3 single unity kernel (just ramping up).

I have double checked the input args, and I believe they are in order.

    eStatusNPP = nppiFilter_8u_C1R(
        pSrc,                 // unsigned char* to 512x512 array, cast appropriately

However, the execution fails with a return value of NPP_CUDA_KERNEL_EXECUTION_ERROR (-3).
At times the display driver crashes and I have to restart windows :(
I would have expected the library to bail out more gracefully.


you say the sample programs work fine–right? So you’ve been able to compile and run the BoxFilter sample we’ve shipped with NPP? If yes, is that what you’ve based your nppiFilter_8u_C1R experiment on?


I believe I discounted the issue of padding. I did not notice anything explicit in the document and (incorrectly) assumed the library would ‘take care’ of it. I did have a lingering question on what kind of padding would be performed, a question I hoped my experiments would answer.

I’ll give it a spin again taking care of pading and report if its not the issue.

Aside: I guess having to restart the comp after a crash comes with the domain of sharing a GPU for actual display.

Turns out I totally missed the concept of copying the data to the device :| Since my starting point was the doc for nppiFilter_8u_C1R, I (again, incorrectly) assumed the library would copy the data to the device.

Digging through the code of the boxfilter example made me realize I need to use the npp namespace functions to set up the data. It would have been nice to have some documentation of the npp namespace.

Assume I am trying to filter a 5x5 image with a 3x3 kernel.

The following setup code to nppiFilter_8u_C1R fails with KERNEL_EXECUTION



//imDims   ={5,5}

//kernelDims= {3,3}

Npp8u* pSrc = nppiMalloc_8u_C1(inputDims[0],inputDims[1], (int*)inputDims);

Npp8u* pDst = nppiMalloc_8u_C1(imDims[0],imDims[1], (int*)imDims);

mxAssert((pSrc!= 0),"Could not allocate source memory on GPU");

mxAssert((pDst!= 0),"Could not allocate destination memory on GPU");

// 7x7, padded buffer



//Coeffcients are expected to be stored in reverse order.

//Does this need to be copied to constant memory? How?

const Npp32s* pKernel = static_cast<Npp32s*>(mxGetData(prhs[1]));

//Advance source pointer beyond the padding to the actual start of image

pSrc= pSrc + 

    padSize*(imDims[0]+2*padSize) //top rows

    + padSize;                    //left side pad

//the number of rows is the distance

//from one raster line to the next     

Npp32s nSrcStep = static_cast<Npp32s> (imDims[0]) + 2*padSize;

//Destination does not have padding

Npp32s nDstStep = static_cast<Npp32s> (imDims[0]);

NppiSize oSizeROI;

oSizeROI.width  = imDims[0]; //int

oSizeROI.height = imDims[1];

NppiSize oKernelSize;

oKernelSize.width  = kernelDims[0];

oKernelSize.height = kernelDims[1];

//must be centered, i.e. 0.5 * (width - 1) in present implementation

NppiPoint oAnchor; 

oAnchor.x=static_cast<int>(0.5*(kernelDims[0]-1)); //TTD: is this the 'correct' way?


//The factor by which the convolved summation from the Filter operation should be divided.

//If equal to the sum of coefficients, this will keep the maximum result value within full scale.

Npp32s nDivisor =1;


Npp8u* pSrc = nppiMalloc_8u_C1(inputDims[0],inputDims[1], (int*)inputDims);

Ok, I don’t understand why you’d cast inputDims (which is a struct) to an (int *)? The third parameter of the nppiMalloc is used to return the line-stride of the 2D region allocated.

cudaMemcpy(pSrc,mxGetData(prhs[0]),inputDims[0]*inputDims[1] ,cudaMemcpyHostToDevice);

This looks to me like you’re trying to do a 2D memcopy, why aren’t you using cudaMemcpy2D? Given that the line stride is usually larger than pixelSize * width, your copy will not copy all the relevant data.

Npp32s nSrcStep = static_cast<Npp32s> (imDims[0]) + 2*padSize;

Where does padSize come from? Wouldn’t you want to use the padding based on what nppiMalloc used for padding?

const Npp32s* pKernel = static_cast<Npp32s*>(mxGetData(prhs[1]));

Not sure what mxGetData(prhs[1]) does, but it should return a device pointer for the kernel data. If it does not return a device pointer, you’ll have to allocated device memory and copy the kernel data up to the device, just like you’re doing for the image data.


I was testing NPP boxFilter.cpp. I replaced the original image load and copy to GPU memory steps with cudaMalloc pointer that contained my image data of size 512x512 uint8 type. After that , I allocated memory for result using the formula – rows/cols - mask.width/height + 1 using cudaMalloc (uint8 type).

The pitch in this case for the Source and Destination images would be equal to their first dimensions , ie. , rows of Src Image (equal to 512) and rows of Destination Image(508 for mask size 5 using the above formula for full image filtering).

This code runs fine — but when I replace 5x5 mask with a 3x3 or 7x7 mask – I get NPP error -1. The documentation doesn’t describe the requirements of NPP functions in terms of Image Dimensions. Need help regarding this – I have data as uint8 type on device and I’m interested in doing some Image processing stuff on this data — but NPP is giving errors.

Could someone also help me with the concept of Anchor pointer – what is the valid range (is it {-1,-1} to {1,1}) and what is the centre point ({0,0}??). And how this effects image processing.

I’m using NPP v1.0 with CUDA toolkit 2.3.

I also tried replacing the ROI for ‘Lena.pgm’ in original boxFilter.cpp and it seems to me that the reference point is bottom-left – as on increasing the ROI in steps – the processed image was growing from bottom-left in place of top-left. But , I always thought the device pointer returned by NPP image class was pointing to image top-left pixel. This is what I would have for the data I want to process using NPP functions. It points to the first element of the matrix.