NPP - nppiFilter_8u_C1R returns KERNEL_EXECUTION Debug options?

blueHash · February 6, 2010, 8:35pm

(Installation is ok: Example cuda programs work fine)

I am trying to perform simple linear filtering on a 512x512 uint8 image with a 3x3 single unity kernel (just ramping up).

I have double checked the input args, and I believe they are in order.

    eStatusNPP = nppiFilter_8u_C1R(
        pSrc,                 // unsigned char* to 512x512 array, cast appropriately
        nSrcStep,         
        pDst,                
        nDstStep,       
        oSizeROI,        
        pKernel,
        oKernelSize,
        oAnchor,
        nDivisor
        );

However, the execution fails with a return value of NPP_CUDA_KERNEL_EXECUTION_ERROR (-3).
At times the display driver crashes and I have to restart windows :(
I would have expected the library to bail out more gracefully.

Frank_Jargstorff · February 9, 2010, 5:10am

Hi,

you say the sample programs work fine–right? So you’ve been able to compile and run the BoxFilter sample we’ve shipped with NPP? If yes, is that what you’ve based your nppiFilter_8u_C1R experiment on?

–Frank

blueHash · February 10, 2010, 9:13pm

Hi,

I’ve thought a little more about your post. The Filter primitives are essentially based on the “mathematical convention” that convolutions are computed with the kernel’s direction inversed (s’(k) = sum s_(k+i)*k_(n-i) where n would be the kernel’s size, s’ the convoluted signal, s the original siganal i the summation index).

Therefore NPP (and IPP) have the kernel extend upward and to the left of the anchor point, not down and to the right as for example the BoxFilter would be. What this means is, that you’d have to provide “border pixels” at the top and left of the ROI, not at the bottem and right as in the BoxFilter sample.

Here’s what you should try: Offset the image pointer you’re passing to the primitive by h lines and w rows, for a filter of h x w size. You do that by computing p’ = p + h * nStepSize + w. Note that this code would only work for byte-size pixels, as the nStepSize is always given in bytes. If that’s not your case, you’d have to convert p’s type to (char *) before doing the line math and then back to its original type before doing the column math.

Hope that helps.

–Frank

I believe I discounted the issue of padding. I did not notice anything explicit in the document and (incorrectly) assumed the library would ‘take care’ of it. I did have a lingering question on what kind of padding would be performed, a question I hoped my experiments would answer.

I’ll give it a spin again taking care of pading and report if its not the issue.

Aside: I guess having to restart the comp after a crash comes with the domain of sharing a GPU for actual display.

blueHash · February 14, 2010, 6:15pm

Turns out I totally missed the concept of copying the data to the device :| Since my starting point was the doc for nppiFilter_8u_C1R, I (again, incorrectly) assumed the library would copy the data to the device.

Digging through the code of the boxfilter example made me realize I need to use the npp namespace functions to set up the data. It would have been nice to have some documentation of the npp namespace.

blueHash · February 14, 2010, 8:12pm

Assume I am trying to filter a 5x5 image with a 3x3 kernel.

The following setup code to nppiFilter_8u_C1R fails with KERNEL_EXECUTION

[codebox]

//inputDims={7,7}

//imDims   ={5,5}

//kernelDims= {3,3}

Npp8u* pSrc = nppiMalloc_8u_C1(inputDims[0],inputDims[1], (int*)inputDims);

Npp8u* pDst = nppiMalloc_8u_C1(imDims[0],imDims[1], (int*)imDims);

mxAssert((pSrc!= 0),"Could not allocate source memory on GPU");

mxAssert((pDst!= 0),"Could not allocate destination memory on GPU");

// 7x7, padded buffer

cudaMemcpy(pSrc,mxGetData(prhs[0]),inputDims[0]*inputDims[1]

,cudaMemcpyHostToDevice);

//Coeffcients are expected to be stored in reverse order.

//Does this need to be copied to constant memory? How?

const Npp32s* pKernel = static_cast<Npp32s*>(mxGetData(prhs[1]));

//Advance source pointer beyond the padding to the actual start of image

pSrc= pSrc + 

    padSize*(imDims[0]+2*padSize) //top rows

    + padSize;                    //left side pad

//the number of rows is the distance

//from one raster line to the next     

Npp32s nSrcStep = static_cast<Npp32s> (imDims[0]) + 2*padSize;

//Destination does not have padding

Npp32s nDstStep = static_cast<Npp32s> (imDims[0]);

NppiSize oSizeROI;

oSizeROI.width  = imDims[0]; //int

oSizeROI.height = imDims[1];

NppiSize oKernelSize;

oKernelSize.width  = kernelDims[0];

oKernelSize.height = kernelDims[1];

//must be centered, i.e. 0.5 * (width - 1) in present implementation

NppiPoint oAnchor; 

oAnchor.x=static_cast<int>(0.5*(kernelDims[0]-1)); //TTD: is this the 'correct' way?

oAnchor.y=static_cast<int>(0.5*(kernelDims[1]-1));

//The factor by which the convolved summation from the Filter operation should be divided.

//If equal to the sum of coefficients, this will keep the maximum result value within full scale.

Npp32s nDivisor =1;

[/codebox]

Frank_Jargstorff · March 19, 2010, 11:57pm

Npp8u* pSrc = nppiMalloc_8u_C1(inputDims[0],inputDims[1], (int*)inputDims);

Ok, I don’t understand why you’d cast inputDims (which is a struct) to an (int *)? The third parameter of the nppiMalloc is used to return the line-stride of the 2D region allocated.

cudaMemcpy(pSrc,mxGetData(prhs[0]),inputDims[0]*inputDims[1] ,cudaMemcpyHostToDevice);

This looks to me like you’re trying to do a 2D memcopy, why aren’t you using cudaMemcpy2D? Given that the line stride is usually larger than pixelSize * width, your copy will not copy all the relevant data.

Npp32s nSrcStep = static_cast<Npp32s> (imDims[0]) + 2*padSize;

Where does padSize come from? Wouldn’t you want to use the padding based on what nppiMalloc used for padding?

const Npp32s* pKernel = static_cast<Npp32s*>(mxGetData(prhs[1]));

Not sure what mxGetData(prhs[1]) does, but it should return a device pointer for the kernel data. If it does not return a device pointer, you’ll have to allocated device memory and copy the kernel data up to the device, just like you’re doing for the image data.

Jaideep · April 25, 2010, 5:12am

Hi,

I was testing NPP boxFilter.cpp. I replaced the original image load and copy to GPU memory steps with cudaMalloc pointer that contained my image data of size 512x512 uint8 type. After that , I allocated memory for result using the formula – rows/cols - mask.width/height + 1 using cudaMalloc (uint8 type).

The pitch in this case for the Source and Destination images would be equal to their first dimensions , ie. , rows of Src Image (equal to 512) and rows of Destination Image(508 for mask size 5 using the above formula for full image filtering).

This code runs fine — but when I replace 5x5 mask with a 3x3 or 7x7 mask – I get NPP error -1. The documentation doesn’t describe the requirements of NPP functions in terms of Image Dimensions. Need help regarding this – I have data as uint8 type on device and I’m interested in doing some Image processing stuff on this data — but NPP is giving errors.

Could someone also help me with the concept of Anchor pointer – what is the valid range (is it {-1,-1} to {1,1}) and what is the centre point ({0,0}??). And how this effects image processing.

I’m using NPP v1.0 with CUDA toolkit 2.3.

I also tried replacing the ROI for ‘Lena.pgm’ in original boxFilter.cpp and it seems to me that the reference point is bottom-left – as on increasing the ROI in steps – the processed image was growing from bottom-left in place of top-left. But , I always thought the device pointer returned by NPP image class was pointing to image top-left pixel. This is what I would have for the data I want to process using NPP functions. It points to the first element of the matrix.

Thanks

Topic		Replies	Views
NPP Box Filtering using NPP CUDA Programming and Performance	1	5853	May 5, 2010
A critical problem with nppiFilter CUDA Programming and Performance	6	7454	February 21, 2013
Problem about Median Filter CUDA Programming and Performance	6	1984	March 26, 2018
cuda beginner - howto simple filter CUDA Programming and Performance	1	4845	November 8, 2010
CUDA 4.0 NPP giving wrong answers : NPP bug possibly ? CUDA Programming and Performance	3	1362	April 20, 2011
NPP_TEXTURE_BIND_ERROR error with Canny edge detector.. CUDA Programming and Performance	14	11038	August 23, 2011
Help using NPP Morphological Functions CUDA Programming and Performance	5	1909	November 7, 2018
Sobel Filter using NPP CUDA Programming and Performance	2	1513	March 4, 2011
CUDA SDK Boxfilter examlpe how to use boxfilter functions? CUDA Programming and Performance	1	1433	November 17, 2021
NPP Integral Image CUDA Programming and Performance	5	3347	June 22, 2011

NPP - nppiFilter_8u_C1R returns KERNEL_EXECUTION Debug options?

Related topics