A critical problem with nppiFilter

Hi everybody

I have a problem with nppiFilter_8u_C1R. It returns noise with 2-dimensional kernels. Besides, when I use it with a 1-dimensional kernel, the source image should be
square otherwise, the destination image is shifted inappropriately. I doubt on my graphic card (GeForce 8400 GS) and memory allocation method to raise this problem.
I use nppMalloc and cudaMemCopy to allocate memory on the device and copy data from host.

Thank you in advance for your kind attention.

I think I need a little more information in order to help. Could you post code or the input and result images?

I think I need a little more information in order to help. Could you post code or the input and result images?

I’m not the original poster, but am having the same problem. Here’s some of my code.

printf("Step sizes: %d, %d\n", paddedImg.pitch(), oDeviceDst.pitch());

        printf("ROI: %d, %d\n", imgSz.width, imgSz.height);

        printf("Kernel Size: %d, %d\n", kernelSize.width, kernelSize.height);

        printf("Kernel Anchor: %d, %d\n", kernelAnchor.x, kernelAnchor.y);

        printf("Divisor: %d\n", intKernel->divisor);

// Copy kernel to GPU

        Npp32s *d_kernel;

        int step;

        d_kernel = nppiMalloc_32s_C1(kernelSize.width, kernelSize.height, &step);

	printf("Kernel step: %d\n", step);

        cudaMemcpy2D(d_kernel, step, intKernel->val, kernelSize.width * sizeof(Npp32s), kernelSize.width * sizeof(Npp32s), kernelSize.height, cudaMemcpyHostToDevice);

	printf("cudaMemcpy2d error status: %s\n", cudaGetErrorString( cudaGetLastError() ) );

	Npp32s *cpuKernel = (Npp32s*)malloc(kernelSize.width*kernelSize.width*sizeof(Npp32s));

	cudaMemcpy2D(cpuKernel, kernelSize.width * sizeof(Npp32s), d_kernel, step, kernelSize.width * sizeof(Npp32s), kernelSize.height, cudaMemcpyDeviceToHost);

	printf("cudaMemcpy2d error status: %s\n", cudaGetErrorString( cudaGetLastError() ) );

	for (int y = 0; y < kernelSize.height; y++) {

           for (int x = 0; x < kernelSize.width; x++) {

              printf("%3d ", cpuKernel[y*kernelSize.width+x]);




eStatusNPP = nppiFilter_8u_C1R(paddedImg.data(widthOffset, heightOffset), paddedImg.pitch(), 

                                       oDeviceDst.data(), oDeviceDst.pitch(), 

                                       imgSz, d_kernel, kernelSize, kernelAnchor, intKernel->divisor);

        printf("nppiFilter error status: %d\n", eStatusNPP);

This is the output from the printf statements in the code running on the 512x512 Lena image that comes with NPP. I’ll attach the output image I get as well.

Step sizes: 768, 768

ROI: 512, 512

Kernel Size: 3, 3

Kernel Anchor: 1, 1

Divisor: 9

Kernel step: 256

cudaMemcpy2d error status: no error

cudaMemcpy2d error status: no error

1 1 1

1 1 1

1 1 1

nppiFilter error status: 0

Saved image: …/…/data/Lena_unsharpMask.pgm

Looking at the code, I see one issue: You’re using a 2D memory allocation and copy for your 3x3 kernel. When you look at nppiFilter_8u_C1R(), you’ll see that there is no parameter for “kernel step”, i.e. the primitive has no way of knowing how far apart successive lines in the kernel are. This is because the function expects the kernel weights to be stored as a densly packed array. E.g. if you kernel is made up of

w00, w01, w02

w10, w11, w12

w20, w21, w22

the array you’re passing to nppiFilter_8u_C1R() should look like this:




| w00 | w01 | w02 | w10 | w11 | w12 | w20 | w21 | w22 |


and you would pass the pointer to the first element to your function. I would recommend you replace the kernel device memory allocation which is currently a 2D alloc with a simple cudaMalloc for the 94 bytes. Since you currently have your host kernel data stored in an image, you’ll have to come up with some way of reformatting those into a contiguous chunk of data. You can use cudaMemcpy2D to do this, by passing the size of densly packed rows (i.e. 34 = 12 bytes) as the destination step.

For a quick hack, you could simply replace the line:

cudaMemcpy2D(d_kernel, step, intKernel->val, kernelSize.width * sizeof(Npp32s), ...);


cudaMemcpy2D(d_kernel, 12, intKernel->val, kernelSize.width * sizeof(Npp32s), ...);

Can you give that a shot and let me know what happens?

I tried both the temporary solution you gave and using a cudaMalloc call and both work great. Thank you.


I’ve been trying to use the nice NPPI filter function, for my image is read as binary into an array. I’ve been getting status=0, which is great,but the image looks cut off. My image is 986x400.
My code looks like this:

extern “C” void twoDseg (int frameWid,
int frameHei,
int frames,
int fileLen,
unsigned char *oneframe_buffer)
{ int sp = 0;

Npp8u pSI = nppiMalloc_8u_C1(width, height, &sp);
cudaMemcpy2D(pSI, sp, oneframe_buffer, width
sizeof(unsigned char), width*sizeof(unsigned char), height, cudaMemcpyHostToDevice);

int dp =0;
Npp8u * pDI = nppiMalloc_8u_C1(width, height, &dp);

NppiSize mask = {5, 5};
NppiPoint anchor = {0, 0};
NppiSize ROI = {width - mask.width/2, height - mask.height/2};

NppStatus eStatusNPP;
eStatusNPP=nppiFilterBox_8u_C1R(pSI, sp, pDI, dp, ROI, mask, anchor);
printf (“eStatusNPP: %i \n”, eStatusNPP);

//Copy the results back to CPU
cudaMemcpy2D(oneframe_buffer,widthsizeof(unsigned char),pDI,dp,widthsizeof(unsigned char),height,cudaMemcpyDeviceToHost);

Can you please help me how to use this gaussian filter for my image?

Thank you,