NPP library functions nppiResize_8U_C3R and nppiBGRToLab_8u_C3R differ from cv::resize() output

I have used two function one is nppiResize_8U_C3R and other is nppiBGRToLab_8u_C3R. for this two function the output that i am getting is differ than opencv function.

I believe that this is due to differences in the way the math is implemented for both these functions.

if anyone is having idea on that please let me know very thankful to you

Thanks,
Madhav

Hi Madhav,

I am currently working on your issue and would like you to provide me some sample code which I can execute to reproduce your output.

Since you did not specify your CUDA version, System, function arguments etc. please catch up on that too.

Moreover I changed the title this thread so somewhat more representative.

  • Fabian

Graphics - QuadroM2200
processor -2.80GHzx8
CUDA Version - 9.0

Target Platform - NVIDIA Drive PX2

resizing factor is half used in both opencv and GPU resize function, used NPPI_INTER_LINEAR and INTER_LINEAR interpolation technique used in GPU resize and opencv resize function respectively.

cv::Mat input = cv::imread(“input.jpg”,CV_LOAD_IMAGE_COLOR);

cv::Mat dst;

this cpu function
cv::resize(input,dst,cv::Size(input.cols/2.,input.rows/2.0),cv::INTER_LINEAR);

this is GPU function
nppiResize_8u_C3R ((const Npp8u *)device_input,nSrcStep,oSrcSize,oSrcRectROI,(Npp8u *)device_output,nDstStep,oDstSize,oDstRectROI, NPPI_INTER_LINEAR);
input.jpeg

Hi Madhav,

thanks much for the information. I am still missing your OpenCV version.

Please describe the differences in your outputs.

  • Fabian

opencv Version is 3.4.0

i have attached to file one is reference output which i got from opencv resize function and another is gpuoutput which i got from gpu function

Thanks

Madhav
refrenceoutput.txt (2.23 MB)
gpuoutput_resize.txt (2.22 MB)

Hi Madhav,

I just want to let you know that we are still working on your issue.

In addition, let me emphasize that in our new nppResize function one needs to make sure that pSrc and pDst point to pixel 0,0 in the corresponding source and destination images. Alongside, by reviewing your certain case, it is important that the image sizes are set to the full size of the source and destination images respectively.

This behavior is different from the way the old nppiResize worked.

That should shed some light to you for the moment and I will come back as soon as I have more input for you.

  • Fabian

Hi Madhav,

finally, I am done testing. Sorry it took so long as we have had planned releases.

In general terms, we are aware of our implementation not matching with the OpenCV one and that’s also as one would expect since there is not a unique solution for it.

When down-sizing images using NPP you will get the best quality result by using NPPI_INTER_SUPER super sampling interpolation mode.

In fact NPP doesn’t guarantee good results using linear interpolation once the downsize scale factor goes beyond a factor of 3 or 4. However, NPPI_INTER_SUPER will reject a resizing call unless both axes are down-scaling.

As already told, NPP does not guarantee quality results when using NPPI_INTER_LINEAR once the down-scaling factor gets too small.
I have tested using 3x3 up to 16x16 pixels images, in the most extreme case with 8 black and 1 white pixel. In this extreme case a down-scaling factor of 2 is enough to demonstrate that.

Original 3x3 black & white image:

Resizing using OpenCV INTER_LINEAR:

Resizing using CUDA NPPI_INTER_LINEAR:

Resizing using CUDA NPPI_INTER_SUPER:

But NPP already recommends using NPPI_INTER_SUPER for down-scaling. Results from doing this are excellent and even out-perform the OpenCV implementation. Be aware, that I did not change the OpenCV interpolation method, so this comparison is somehow unfair.

Of course, as I already mentioned, both axes must be down-scaling or NPP will reject the call. Therefore NPP_INTER_SUPER cannot be used when one dimension is up-scaling and the other is down-scaling.
Indeed this is a limitation, but NPP behavior is acting as expected.

You could also try NPPI_INTER_CUBIC (which is not recommended for extreme down-scaling either, but would do better here) or NPPI_INTER_LANCZOS and would likely get better results.

  • Fabian

Hi,

Seems like I’m having related problems. When down-sizing using nppiResize_32f_C1R (both x&y decimation), the only filter that yields anything but linear interpolation looking result is NPPI_INTER_SUPER. In fact, the ONLY filter that yields any differences at all is NPPI_INTER_SUPER.

I’d like to see the results with NPPI_INTER_LANCZOS (which - like all the other alternatives - behaves exactly like linear interpolation, these run into errors -

Source file: 1920 x 1080, destination 480 x 270, 32 bit float grayscale (1 channel)

    // NPPI_INTER_CUBIC2P_CATMULLROM = error 22
    // NPPI_INTER_CUBIC2P_B05C03 = error 22
    // NPPI_INTER_LANCZOS3_ADVANCED = error 22
    // NPPI_INTER_CUBIC2P_BSPLINE = error 22
    // NPPI_SMOOTH_EDGE = error 22

Any ideas?

Thanks,

NPP Library Version 10.2.0
CUDA Driver Version: 10.1
CUDA Runtime Version: 10.1
Device 0: <GeForce GTX 1070 >, Compute SM 6.1 detected

Hi,

I have a 32bit float RGB 2D array, interleaved, RGBRGBRGBRGB…

Compiles & runs with no errors.

Using nppiResize_32f_C3R, can someone please take a look at the below code, and try let me know what’s amiss. The section for single channel Grayscale/B&W works perfectly. The RGB code produces mumbo-jumbo. Thanks:

// NPPI_INTER_SUPER will reject a resizing call unless BOTH x/y axes are reduced in size.
// nppiMalloc & nppiFree links with -lnppisu library

// 2D pitched allocations

#include <Exceptions.h>
#include <cuda_runtime.h>
#include <npp.h>
#include <nppi.h>
#include <nppdefs.h>

#define CUDA_CALL(call) do { cudaError_t cuda_error = call; if(cuda_error != cudaSuccess) { std::cerr << "CUDA Error: " << cudaGetErrorString(cuda_error) << ", " << FILE << ", line " << LINE << std::endl; return(NULL);} } while(0)

float* decimate_cuda(float* readbuff, uint32_t nSrcH, uint32_t nSrcW, uint32_t nDstH, uint32_t nDstW, uint8_t byteperpixel)
{
if (byteperpixel == 1){ // source : byteperpixel == 1, Grayscale / B&W, 1 x 32 bit float, YYYY…
size_t srcStep;
size_t dstStep;
// rows = height; columns = width

    NppiSize oSrcSize = {nSrcW, nSrcH};
    NppiRect oSrcROI = {0, 0, nSrcW, nSrcH};
    float *devSrc;
    CUDA_CALL(cudaMallocPitch((void**)&devSrc, &srcStep, nSrcW * sizeof(float), nSrcH));
    CUDA_CALL(cudaMemcpy2D((void**)devSrc, srcStep,(void**)readbuff, nSrcW * sizeof(Npp32f), nSrcW * sizeof(Npp32f), nSrcH, cudaMemcpyHostToDevice));
    
    NppiSize oDstSize = {nDstW, nDstH};      
    NppiRect oDstROI = {0, 0, nDstW, nDstH};
    float *devDst;
    CUDA_CALL(cudaMallocPitch((void**)&devDst, &dstStep, nDstW * sizeof(float), nDstH));
    
    NppStatus result = nppiResize_32f_C1R(devSrc,       // Y floats
                                          srcStep,   // nSrcW * 3 for RGB, // stride / pitch
                                          oSrcSize,
                                          oSrcROI,
                                          devDst,
                                          dstStep,   // nDstW * 3 for RGB, // stride / pitch
                                          oDstSize,
                                          oDstROI,
                                          NPPI_INTER_SUPER);
    if (result != NPP_SUCCESS) {
        std::cerr << "Unable to run decimate_cuda, error " << result << std::endl;
    }
    
    Npp64s                 writesize;
    Npp32f                 *hostDst;
    writesize = (Npp64s)   nDstW * nDstH;                       // Y
    if(NULL == (hostDst = (Npp32f *)malloc(writesize * sizeof(Npp32f)))){
        printf("Error : Unable to alloctae hostDst in decimate_cuda, exiting...\n");
        exit(1);
    }

    CUDA_CALL(cudaMemcpy2D(hostDst, nDstW * sizeof(Npp32f),(void**)devDst, dstStep, nDstW * sizeof(Npp32f),nDstH, cudaMemcpyDeviceToHost));

    // nppiFree(devSrc);
    // nppiFree(devDst);
    
    CUDA_CALL(cudaFree(devSrc));
    CUDA_CALL(cudaFree(devDst));
    
    return(hostDst);
}                       // source : byteperpixel == 1, Grayscale / B&W, 1 x 32 bit float, YYYY...
else if (byteperpixel == 3){ // source : byteperpixel = 3 x 32bit float interleaved RGBRGBRGB...
    size_t  srcStep; 
    size_t  dstStep;
    // rows = height; columns = width
    
    NppiSize oSrcSize = {nSrcW, nSrcH};
    NppiRect oSrcROI = {0, 0, nSrcW, nSrcH};
    float *devSrc;
    CUDA_CALL(cudaMallocPitch((void**)&devSrc, &srcStep, 3 * nSrcW * sizeof(float), nSrcH));
    CUDA_CALL(cudaMemcpy2D((void**)devSrc, srcStep, (void**)readbuff, 3 * nSrcW * sizeof(Npp32f), nSrcW * sizeof(Npp32f), nSrcH, cudaMemcpyHostToDevice));
    
    NppiSize oDstSize = {nDstW, nDstH};      
    NppiRect oDstROI = {0, 0, nDstW, nDstH}; 
    float *devDst;
    CUDA_CALL(cudaMallocPitch((void**)&devDst, &dstStep, 3 * nDstW * sizeof(float), nDstH));
    
    NppStatus result = nppiResize_32f_C3R(devSrc,       // RGB floats
                                          srcStep,   // nSrcW * 3 for RGB, // stride / pitch
                                          oSrcSize,
                                          oSrcROI,
                                          devDst,
                                          dstStep,   // nDstW * 3 for RGB, // stride / pitch
                                          oDstSize,
                                          oDstROI,
                                          NPPI_INTER_SUPER);
    if (result != NPP_SUCCESS) {
        std::cerr << "Unable to run decimate_cuda, error " << result << std::endl;
    }
    
    Npp64s                 writesize;
    Npp32f                 *hostDst;
    writesize = (Npp64s)   nDstW * nDstH * 3;                       // RGB
    if(NULL == (hostDst = (Npp32f *)malloc(writesize * sizeof(Npp32f)))){
        printf("Error : Unable to alloctae hostDst in decimate_cuda, exiting...\n");
        exit(1);
    }

    CUDA_CALL(cudaMemcpy2D((void**)hostDst, nDstW * sizeof(Npp32f), (void**)devDst, dstStep, nDstW * sizeof(Npp32f),nDstH, cudaMemcpyDeviceToHost));
    
    // nppiFree(devSrc);
    // nppiFree(devDst);
    CUDA_CALL(cudaFree(devSrc));
    CUDA_CALL(cudaFree(devDst));
    
    return(hostDst);
}                       // source : byteperpixel == 3; 3 x 32bit float interleaved RGBRGBRGB...

return(0);

}

Here’s an update. The pitch / stride was the primary confusing issue for the bad code, and I got help on Stackoverflow with the correct code below.

All filters run, however when down-sizing, the result seems to be the same, with the exception of NPPI_INTER_SUPER. I’d like to see the results of NPPI_INTER_LANCZOS, & NPPI_INTER_LANCZOS3_ADVANCED. Any ideas would be appreciated. Thanks for reading…

#include <cuda_runtime.h>
#include <npp.h>
#include <nppi.h>
#include <nppdefs.h>
#include
#include <stdint.h>
#include <stdio.h>
#define CUDA_CALL(call) do { cudaError_t cuda_error = call; if(cuda_error != cudaSuccess) { std::cerr << "CUDA Error: " << cudaGetErrorString(cuda_error) << ", " << FILE << ", line " << LINE << std::endl; return(NULL);} } while(0)
using namespace std;
float* decimate_cuda(float* readbuff, uint32_t nSrcH, uint32_t nSrcW, uint32_t nDstH, uint32_t nDstW, uint8_t byteperpixel)
{
if (byteperpixel == 1){ // source : Grayscale, 1 x 32f
size_t srcStep;
size_t dstStep;

        NppiSize oSrcSize = {nSrcW, nSrcH};
        NppiRect oSrcROI = {0, 0, nSrcW, nSrcH};
        float *devSrc;
        CUDA_CALL(cudaMallocPitch((void**)&devSrc, &srcStep, nSrcW * sizeof(float), nSrcH));
        CUDA_CALL(cudaMemcpy2D(devSrc, srcStep,readbuff, nSrcW * sizeof(Npp32f), nSrcW * sizeof(Npp32f), nSrcH, cudaMemcpyHostToDevice));

        NppiSize oDstSize = {nDstW, nDstH};
        NppiRect oDstROI = {0, 0, nDstW, nDstH};
        float *devDst;
        CUDA_CALL(cudaMallocPitch((void**)&devDst, &dstStep, nDstW * sizeof(float), nDstH));

        NppStatus result = nppiResize_32f_C1R(devSrc,srcStep,oSrcSize,oSrcROI,devDst,dstStep,oDstSize,oDstROI,NPPI_INTER_SUPER);
        if (result != NPP_SUCCESS) {
            std::cerr << "Unable to run decimate_cuda, error " << result << std::endl;
        }

        Npp64s                 writesize;
        Npp32f                 *hostDst;
        writesize = (Npp64s)   nDstW * nDstH;         // Y
        if(NULL == (hostDst = (Npp32f *)malloc(writesize * sizeof(Npp32f)))){
            printf("Error : Unable to alloctae hostDst in decimate_cuda, exiting...\n");
            exit(1);
        }

        CUDA_CALL(cudaMemcpy2D(hostDst, nDstW * sizeof(Npp32f),devDst, dstStep, nDstW * sizeof(Npp32f),nDstH, cudaMemcpyDeviceToHost));
        CUDA_CALL(cudaFree(devSrc));
        CUDA_CALL(cudaFree(devDst));
        return(hostDst);
    }                            // source : Grayscale 1 x 32f, YYYY...
    else if (byteperpixel == 3){ // source : 3 x 32f interleaved RGBRGBRGB...
        size_t  srcStep;
        size_t  dstStep;
        // rows = height; columns = width

        NppiSize oSrcSize = {nSrcW, nSrcH};
        NppiRect oSrcROI = {0, 0, nSrcW, nSrcH};
        float *devSrc;
        CUDA_CALL(cudaMallocPitch((void**)&devSrc, &srcStep, 3 * nSrcW * sizeof(float), nSrcH));
        CUDA_CALL(cudaMemcpy2D(devSrc, srcStep,readbuff, 3 * nSrcW * sizeof(Npp32f), 3*nSrcW * sizeof(Npp32f), nSrcH, cudaMemcpyHostToDevice));

        NppiSize oDstSize = {nDstW, nDstH};
        NppiRect oDstROI = {0, 0, nDstW, nDstH};
        float *devDst;
        CUDA_CALL(cudaMallocPitch((void**)&devDst, &dstStep, 3 * nDstW * sizeof(float), nDstH));

        NppStatus result = nppiResize_32f_C3R(devSrc,srcStep,oSrcSize,oSrcROI,devDst,dstStep,oDstSize,oDstROI,NPPI_INTER_SUPER);
        if (result != NPP_SUCCESS) {
            std::cerr << "Unable to run decimate_cuda, error " << result << std::endl;
        }

        Npp64s                 writesize;
        Npp32f                 *hostDst;
        writesize = (Npp64s)   nDstW * nDstH * 3;          // RGB
        if(NULL == (hostDst = (Npp32f *)malloc(writesize * sizeof(Npp32f)))){
            printf("Error : Unable to alloctae hostDst in decimate_cuda, exiting...\n");
            exit(1);
        }

        CUDA_CALL(cudaMemcpy2D(hostDst, nDstW*3 * sizeof(Npp32f), devDst, dstStep, nDstW*3 * sizeof(Npp32f),nDstH, cudaMemcpyDeviceToHost));

        CUDA_CALL(cudaFree(devSrc));
        CUDA_CALL(cudaFree(devDst));
        return(hostDst);
    }        // source - 3 x 32f, interleaved RGBRGBRGB...

    return(0);
}

int main(){
uint32_t nSrcH = 480;
uint32_t nSrcW = 640;
uint8_t byteperpixel = 3;
float readbuff = (float )malloc(nSrcWnSrcHbyteperpixelsizeof(float));
for (int i = 0; i < nSrcH
nSrcW; i++){
readbuff [i*3+0] = 1.0f;
readbuff [i*3+1] = 2.0f;
readbuff [i*3+2] = 3.0f;}
uint32_t nDstW = nSrcW/2;
uint32_t nDstH = nSrcH/2;
float res = decimate_cuda(readbuff, nSrcH, nSrcW, nDstH, nDstW, byteperpixel);
for (int i = 0; i < nDstH
nDstW*byteperpixel; i++) if (res[i] != ((i%3)+1.0f)) {std::cout << "error at: " << i << std::endl; return 0;}
return 0;
}