nppiFilterGaussAdvanced_8u_C1R produces discolored band on the right

Hi,
I applied gaussian convolution to an image filled with constant value and got obvious error in the rightmost few dozen pixels. Below is the code and attached is the result.
I am using Orin 36.3.0
Thank you
nppiFilterGaussAdvanced_8u_C1R_bad

/*
g++ -I /usr/local/include/opencv4 \
    -L /usr/local/lib \
    -I /usr/local/cuda/targets/aarch64-linux/include \
    -L /usr/local/cuda/targets/aarch64-linux/lib \
    -o testNpp testNpp.cpp \
    -lopencv_core -lopencv_cudaimgproc -lopencv_imgcodecs -lopencv_imgproc \
    -lnppif -lnppc -lnppisu

width=500 height=500 outputFileName=out.png ./testNpp

*/
#include <npp.h>
#include <opencv2/opencv.hpp> //OpenCV only for cv::write, upload and download

int main(int argc, char ** argv)
{
    const char * outputFileName = getenv("outputFileName");
    if(!outputFileName)
    {
        printf("no outputFileName\n");
        return -1;
    }
    
    int width = 512;
    int height = 512;
    const char * temp = getenv("width");
    if(temp)
    {
        width = strtol(temp, NULL, 10);
    }
    temp = getenv("height");
    if(temp)
    {
        height = strtol(temp, NULL, 10);
    }
    printf("width %d height %d\n", width, height);
    
    int kernelSize {5};
    cv::cuda::GpuMat gpuInput;
    gpuInput.create(height, width, CV_8UC1);
    gpuInput.setTo(cv::Scalar(0x80));
    
    if(getenv("withCross"))
    {
        //Let's make a cross
        cv::cuda::GpuMat rect1(gpuInput, cv::Rect (0, height/2, width, 10));
        rect1.setTo(cv::Scalar(0));
        cv::cuda::GpuMat rect2(gpuInput, cv::Rect (width/2, 0, 10, height));
        rect2.setTo(cv::Scalar(255));
    }
    
    double sigma {3.0};
    cv::cuda::GpuMat gpuKernel;
    cv::Mat uf(1, kernelSize + 2, CV_32F, 0.0);//should be 2 pixel wider than the kernel.
    uf.at<float>(0, 1 + kernelSize/2) = 1.0;
    std::cout << uf << std::endl;
    cv::Mat kf(1, kernelSize, CV_32F, 0.0);
    cv::GaussianBlur(uf, kf, cv::Size(kernelSize, 1), sigma, 1);
    cv::Mat kernel = kf(cv::Rect(1, 0, kernelSize, 1));//Exclude starting and ending 0
    std::cout << kernel << std::endl;
    cv::cuda::GpuMat gpuResult;
    
    gpuKernel.upload(kernel);
    
    gpuResult.create(height, width, CV_8UC1);

    NppiPoint oSrcOffset {0,0};
    NppiSize roi { width, height};
    NppiPoint anchor { kernelSize/2, kernelSize/2 };
    printf("%p %d %p %d %p\n", gpuInput.data, (int)gpuInput.step, gpuResult.data, (int)gpuResult.step, gpuKernel.data);
    NppStatus nppStatus = nppiFilterGaussAdvanced_8u_C1R(
        gpuInput.data, gpuInput.step, 
        gpuResult.data, gpuResult.step, 
        roi, kernelSize, (const Npp32f *)gpuKernel.data);
    if(nppStatus)
    {
        printf("nppiFilterGaussAdvanced_8u_C1R failed: %d\n", nppStatus);
        return -1;
    }
    
    cv::Mat output;
    gpuResult.download(output);
    if(!cv::imwrite(outputFileName, output))
    {
        printf("cv::imwrite %s failed\n", outputFileName);
        return -1;
    }
    return 0;
}
  1. I won’t be able to debug your openCV code or the interface from openCV to npp
  2. You don’t seem to be handling borders correctly
  3. You may not have a proper kernel setup. openCV gpuMat usually creates a pitched allocation, this is not what you want.
  4. try running your code with compute-sanitizer (or cuda-memcheck if you are running on an older GPU like I am - but compute-sanitizer should be correct for your Orin) to get an idea if there is something obviously wrong with what you are doing.

Here is a non-OpenCV example that seems to work correctly, for me.

$ cat t40.cpp
#include <npp.h>
#include <iostream>
#include <cstdio>
#include "lodepng.h"

// for png encoding of output images
void encodeOneStep(const char* filename, const unsigned char* image, unsigned width, unsigned height) {
  /*Encode the image*/
  unsigned error = lodepng_encode32_file(filename, image, width, height);

  /*if there's an error, display it*/
  if(error) printf("error %u: %s\n", error, lodepng_error_text(error));
}
template <typename T>
void write_u8(const char *fname, int width, int height, T *d){
    unsigned char out_im[width*height*4];
    for (int i = 0; i < width*height; i++) {
	out_im[i*4+0] = d[i];    
	out_im[i*4+1] = d[i];    
	out_im[i*4+2] = d[i];    
	out_im[i*4+3] = 255;}    
    encodeOneStep(fname, out_im, width, height);
}

int main(int argc, char ** argv)
{
    const char * outputFileName = "out.png";
    
    int width = 500;
    int height = 500;
    printf("width %d height %d\n", width, height);
    const int kernelSize {5};
    Npp8u *gpuInput;
    cudaMalloc(&gpuInput, width*height*sizeof(Npp8u));
    cudaMemset(gpuInput, 0x040, width*height*sizeof(Npp8u)); 
    
    Npp8u *gpuResult;
    cudaMalloc(&gpuResult, (width-kernelSize)*(height-kernelSize)*sizeof(Npp8u));
    cudaMemset(gpuResult, 0x0, (width-kernelSize)*(height-kernelSize)*sizeof(Npp8u)); 
    Npp32f *gpuKernel;
    cudaMalloc(&gpuKernel, kernelSize*sizeof(Npp32f));
    Npp32f gpuKernel_h[kernelSize] = {1, 4, 7, 4, 1};
    for (int i = 0; i < kernelSize; i++) gpuKernel_h[i] /= 17.0f;
    NppiSize roi { width-kernelSize, height-kernelSize};
    cudaMemcpy(gpuKernel, gpuKernel_h, kernelSize*sizeof(Npp32f), cudaMemcpyHostToDevice);
    NppStatus nppStatus = nppiFilterGaussAdvanced_8u_C1R(
        gpuInput+(kernelSize/2)*(width+1), width, 
        gpuResult, width-kernelSize,
        roi, kernelSize, gpuKernel);
    if(nppStatus)
    {
        printf("nppiFilterGaussAdvanced_8u_C1R failed: %d\n", nppStatus);
        return -1;
    }
    Npp8u *gpuResult_h = new Npp8u[(width-kernelSize)*(height-kernelSize)];
    cudaMemcpy(gpuResult_h, gpuResult, (width-kernelSize)*(height-kernelSize)*sizeof(Npp8u), cudaMemcpyDeviceToHost);
    write_u8(outputFileName, width-kernelSize, height-kernelSize, gpuResult_h);
    return 0;
}
$ nvcc -o t40 t40.cpp lodepng.cpp -lnppif
$ cuda-memcheck ./t40
========= CUDA-MEMCHECK
width 500 height 500
========= ERROR SUMMARY: 0 errors
$

out (1)

You’ll note the output image is 495x495, rather than 500x500. You can get the lodepng stuff here.

It appears that openCV is not the problem - it allows assigning any step/pitch, for example:

gpuInput.step = gpuInput.cols;
gpuResult.step = gpuResult.cols;

and then nppiFilterGaussAdvanced_8u_C1R works OK, just like in your example.

The requirement to have pitch padded to 256 bytes is coming from Nvidia - on Orin we are supposed to use NvBufSurfaceAllocate, which always pads pitch to 256. OpenCV works fine with that pitch, but not nppiFilterGaussAdvanced, even if has parameters “nSrcStep” and “nDstStep”, which suggests that step can be larger than width. I found that I can, indeed, provide nSrcStep = 512 and get correct result, but nDstStep has to be 495, otherwise the discolored band appears.

It is interesting that there are no examples or discussions on the Web about NPP APIs. It appears that I am the first person using it after 15 years of its existence.

I don’t think there is any requirement that pitch be padded to 256 in NPP. If you want to provide an example that uses NPP and CUDA C++ and nothing else, that you think is misbehaving, I’ll take a look as time permits. Defects are always possible.

There are various questions about NPP here on these forums.
You can put the following into a chrome search bar:

site:https://forums.developer.nvidia.com/ npp

No, I meant that Orin has a requirement that image rows are padded to 256 for some operations. So, if NPP cannot handle padded rows, then I will need to include an additional step or two to convert, which is wasteful.

Yes, I googled everywhere, but prior to your answer above there were no working examples of nppiFilterGaussAdvanced_8u_C1R and no explanation that it cannot handle padded rows.

NPP can handle “padded rows” (i.e. a pitched allocation) on both the input and output arrays/data. The place where I indicated care is needed is for the filter kernel definition. That is not expected to be padded in any way. If your kernel width is 5, then you are expected to provide a data array resident in device memory that has a contiguous allocation of 5 elements. If you “pad” the end of those 5 elements, it’s not going to cause any issues, but other types of arrangement might.