NPP library functions nppiResize_8U_C3R and nppiBGRToLab_8u_C3R differ from cv::resize() output

madhav.chamle · October 25, 2018, 3:28am

I have used two function one is nppiResize_8U_C3R and other is nppiBGRToLab_8u_C3R. for this two function the output that i am getting is differ than opencv function.

I believe that this is due to differences in the way the math is implemented for both these functions.

if anyone is having idea on that please let me know very thankful to you

Thanks,
Madhav

FabianWeise · October 26, 2018, 11:42am

Hi Madhav,

I am currently working on your issue and would like you to provide me some sample code which I can execute to reproduce your output.

Since you did not specify your CUDA version, System, function arguments etc. please catch up on that too.

Moreover I changed the title this thread so somewhat more representative.

Fabian

madhav.chamle · October 29, 2018, 3:49am

Graphics - QuadroM2200
processor -2.80GHzx8
CUDA Version - 9.0

Target Platform - NVIDIA Drive PX2

resizing factor is half used in both opencv and GPU resize function, used NPPI_INTER_LINEAR and INTER_LINEAR interpolation technique used in GPU resize and opencv resize function respectively.

cv::Mat input = cv::imread(“input.jpg”,CV_LOAD_IMAGE_COLOR);

cv::Mat dst;

this cpu function
cv::resize(input,dst,cv::Size(input.cols/2.,input.rows/2.0),cv::INTER_LINEAR);

this is GPU function
nppiResize_8u_C3R ((const Npp8u *)device_input,nSrcStep,oSrcSize,oSrcRectROI,(Npp8u *)device_output,nDstStep,oDstSize,oDstRectROI, NPPI_INTER_LINEAR);

FabianWeise · October 29, 2018, 2:52pm

Hi Madhav,

thanks much for the information. I am still missing your OpenCV version.

Please describe the differences in your outputs.

Fabian

madhav.chamle · October 30, 2018, 6:58am

opencv Version is 3.4.0

i have attached to file one is reference output which i got from opencv resize function and another is gpuoutput which i got from gpu function

Thanks

Madhav
refrenceoutput.txt (2.23 MB)
gpuoutput_resize.txt (2.22 MB)

FabianWeise · November 27, 2018, 9:13am

Hi Madhav,

I just want to let you know that we are still working on your issue.

In addition, let me emphasize that in our new nppResize function one needs to make sure that pSrc and pDst point to pixel 0,0 in the corresponding source and destination images. Alongside, by reviewing your certain case, it is important that the image sizes are set to the full size of the source and destination images respectively.

This behavior is different from the way the old nppiResize worked.

That should shed some light to you for the moment and I will come back as soon as I have more input for you.

Fabian

FabianWeise · December 10, 2018, 11:11pm

Hi Madhav,

finally, I am done testing. Sorry it took so long as we have had planned releases.

In general terms, we are aware of our implementation not matching with the OpenCV one and that’s also as one would expect since there is not a unique solution for it.

When down-sizing images using NPP you will get the best quality result by using NPPI_INTER_SUPER super sampling interpolation mode.

In fact NPP doesn’t guarantee good results using linear interpolation once the downsize scale factor goes beyond a factor of 3 or 4. However, NPPI_INTER_SUPER will reject a resizing call unless both axes are down-scaling.

As already told, NPP does not guarantee quality results when using NPPI_INTER_LINEAR once the down-scaling factor gets too small.
I have tested using 3x3 up to 16x16 pixels images, in the most extreme case with 8 black and 1 white pixel. In this extreme case a down-scaling factor of 2 is enough to demonstrate that.

Original 3x3 black & white image:
External Media

Resizing using OpenCV INTER_LINEAR:
External Media

Resizing using CUDA NPPI_INTER_LINEAR:
External Media

Resizing using CUDA NPPI_INTER_SUPER:
External Media

But NPP already recommends using NPPI_INTER_SUPER for down-scaling. Results from doing this are excellent and even out-perform the OpenCV implementation. Be aware, that I did not change the OpenCV interpolation method, so this comparison is somehow unfair.

Of course, as I already mentioned, both axes must be down-scaling or NPP will reject the call. Therefore NPP_INTER_SUPER cannot be used when one dimension is up-scaling and the other is down-scaling.
Indeed this is a limitation, but NPP behavior is acting as expected.

You could also try NPPI_INTER_CUBIC (which is not recommended for extreme down-scaling either, but would do better here) or NPPI_INTER_LANCZOS and would likely get better results.

Fabian

chrisl4a2j · September 29, 2019, 2:29am

Hi,

Seems like I’m having related problems. When down-sizing using nppiResize_32f_C1R (both x&y decimation), the only filter that yields anything but linear interpolation looking result is NPPI_INTER_SUPER. In fact, the ONLY filter that yields any differences at all is NPPI_INTER_SUPER.

I’d like to see the results with NPPI_INTER_LANCZOS (which - like all the other alternatives - behaves exactly like linear interpolation, these run into errors -

Source file: 1920 x 1080, destination 480 x 270, 32 bit float grayscale (1 channel)

    // NPPI_INTER_CUBIC2P_CATMULLROM = error 22
    // NPPI_INTER_CUBIC2P_B05C03 = error 22
    // NPPI_INTER_LANCZOS3_ADVANCED = error 22
    // NPPI_INTER_CUBIC2P_BSPLINE = error 22
    // NPPI_SMOOTH_EDGE = error 22

Any ideas?

Thanks,

NPP Library Version 10.2.0
CUDA Driver Version: 10.1
CUDA Runtime Version: 10.1
Device 0: <GeForce GTX 1070 >, Compute SM 6.1 detected

chrisl4a2j · September 29, 2019, 10:18pm

Hi,

I have a 32bit float RGB 2D array, interleaved, RGBRGBRGBRGB…

Compiles & runs with no errors.

Using nppiResize_32f_C3R, can someone please take a look at the below code, and try let me know what’s amiss. The section for single channel Grayscale/B&W works perfectly. The RGB code produces mumbo-jumbo. Thanks:

// NPPI_INTER_SUPER will reject a resizing call unless BOTH x/y axes are reduced in size.
// nppiMalloc & nppiFree links with -lnppisu library

// 2D pitched allocations

#include <Exceptions.h>
#include <cuda_runtime.h>
#include <npp.h>
#include <nppi.h>
#include <nppdefs.h>

#define CUDA_CALL(call) do { cudaError_t cuda_error = call; if(cuda_error != cudaSuccess) { std::cerr << "CUDA Error: " << cudaGetErrorString(cuda_error) << ", " << FILE << ", line " << LINE << std::endl; return(NULL);} } while(0)

float* decimate_cuda(float* readbuff, uint32_t nSrcH, uint32_t nSrcW, uint32_t nDstH, uint32_t nDstW, uint8_t byteperpixel)
{
if (byteperpixel == 1){ // source : byteperpixel == 1, Grayscale / B&W, 1 x 32 bit float, YYYY…
size_t srcStep;
size_t dstStep;
// rows = height; columns = width

    NppiSize oSrcSize = {nSrcW, nSrcH};
    NppiRect oSrcROI = {0, 0, nSrcW, nSrcH};
    float *devSrc;
    CUDA_CALL(cudaMallocPitch((void**)&devSrc, &srcStep, nSrcW * sizeof(float), nSrcH));
    CUDA_CALL(cudaMemcpy2D((void**)devSrc, srcStep,(void**)readbuff, nSrcW * sizeof(Npp32f), nSrcW * sizeof(Npp32f), nSrcH, cudaMemcpyHostToDevice));
    
    NppiSize oDstSize = {nDstW, nDstH};      
    NppiRect oDstROI = {0, 0, nDstW, nDstH};
    float *devDst;
    CUDA_CALL(cudaMallocPitch((void**)&devDst, &dstStep, nDstW * sizeof(float), nDstH));
    
    NppStatus result = nppiResize_32f_C1R(devSrc,       // Y floats
                                          srcStep,   // nSrcW * 3 for RGB, // stride / pitch
                                          oSrcSize,
                                          oSrcROI,
                                          devDst,
                                          dstStep,   // nDstW * 3 for RGB, // stride / pitch
                                          oDstSize,
                                          oDstROI,
                                          NPPI_INTER_SUPER);
    if (result != NPP_SUCCESS) {
        std::cerr << "Unable to run decimate_cuda, error " << result << std::endl;
    }
    
    Npp64s                 writesize;
    Npp32f                 *hostDst;
    writesize = (Npp64s)   nDstW * nDstH;                       // Y
    if(NULL == (hostDst = (Npp32f *)malloc(writesize * sizeof(Npp32f)))){
        printf("Error : Unable to alloctae hostDst in decimate_cuda, exiting...\n");
        exit(1);
    }

    CUDA_CALL(cudaMemcpy2D(hostDst, nDstW * sizeof(Npp32f),(void**)devDst, dstStep, nDstW * sizeof(Npp32f),nDstH, cudaMemcpyDeviceToHost));

    // nppiFree(devSrc);
    // nppiFree(devDst);
    
    CUDA_CALL(cudaFree(devSrc));
    CUDA_CALL(cudaFree(devDst));
    
    return(hostDst);
}                       // source : byteperpixel == 1, Grayscale / B&W, 1 x 32 bit float, YYYY...
else if (byteperpixel == 3){ // source : byteperpixel = 3 x 32bit float interleaved RGBRGBRGB...
    size_t  srcStep; 
    size_t  dstStep;
    // rows = height; columns = width
    
    NppiSize oSrcSize = {nSrcW, nSrcH};
    NppiRect oSrcROI = {0, 0, nSrcW, nSrcH};
    float *devSrc;
    CUDA_CALL(cudaMallocPitch((void**)&devSrc, &srcStep, 3 * nSrcW * sizeof(float), nSrcH));
    CUDA_CALL(cudaMemcpy2D((void**)devSrc, srcStep, (void**)readbuff, 3 * nSrcW * sizeof(Npp32f), nSrcW * sizeof(Npp32f), nSrcH, cudaMemcpyHostToDevice));
    
    NppiSize oDstSize = {nDstW, nDstH};      
    NppiRect oDstROI = {0, 0, nDstW, nDstH}; 
    float *devDst;
    CUDA_CALL(cudaMallocPitch((void**)&devDst, &dstStep, 3 * nDstW * sizeof(float), nDstH));
    
    NppStatus result = nppiResize_32f_C3R(devSrc,       // RGB floats
                                          srcStep,   // nSrcW * 3 for RGB, // stride / pitch
                                          oSrcSize,
                                          oSrcROI,
                                          devDst,
                                          dstStep,   // nDstW * 3 for RGB, // stride / pitch
                                          oDstSize,
                                          oDstROI,
                                          NPPI_INTER_SUPER);
    if (result != NPP_SUCCESS) {
        std::cerr << "Unable to run decimate_cuda, error " << result << std::endl;
    }
    
    Npp64s                 writesize;
    Npp32f                 *hostDst;
    writesize = (Npp64s)   nDstW * nDstH * 3;                       // RGB
    if(NULL == (hostDst = (Npp32f *)malloc(writesize * sizeof(Npp32f)))){
        printf("Error : Unable to alloctae hostDst in decimate_cuda, exiting...\n");
        exit(1);
    }

    CUDA_CALL(cudaMemcpy2D((void**)hostDst, nDstW * sizeof(Npp32f), (void**)devDst, dstStep, nDstW * sizeof(Npp32f),nDstH, cudaMemcpyDeviceToHost));
    
    // nppiFree(devSrc);
    // nppiFree(devDst);
    CUDA_CALL(cudaFree(devSrc));
    CUDA_CALL(cudaFree(devDst));
    
    return(hostDst);
}                       // source : byteperpixel == 3; 3 x 32bit float interleaved RGBRGBRGB...

return(0);

}

chrisl4a2j · September 30, 2019, 9:36pm

Here’s an update. The pitch / stride was the primary confusing issue for the bad code, and I got help on Stackoverflow with the correct code below.

All filters run, however when down-sizing, the result seems to be the same, with the exception of NPPI_INTER_SUPER. I’d like to see the results of NPPI_INTER_LANCZOS, & NPPI_INTER_LANCZOS3_ADVANCED. Any ideas would be appreciated. Thanks for reading…

#include <cuda_runtime.h>
#include <npp.h>
#include <nppi.h>
#include <nppdefs.h>
#include
#include <stdint.h>
#include <stdio.h>
#define CUDA_CALL(call) do { cudaError_t cuda_error = call; if(cuda_error != cudaSuccess) { std::cerr << "CUDA Error: " << cudaGetErrorString(cuda_error) << ", " << FILE << ", line " << LINE << std::endl; return(NULL);} } while(0)
using namespace std;
float* decimate_cuda(float* readbuff, uint32_t nSrcH, uint32_t nSrcW, uint32_t nDstH, uint32_t nDstW, uint8_t byteperpixel)
{
if (byteperpixel == 1){ // source : Grayscale, 1 x 32f
size_t srcStep;
size_t dstStep;

        NppiSize oSrcSize = {nSrcW, nSrcH};
        NppiRect oSrcROI = {0, 0, nSrcW, nSrcH};
        float *devSrc;
        CUDA_CALL(cudaMallocPitch((void**)&devSrc, &srcStep, nSrcW * sizeof(float), nSrcH));
        CUDA_CALL(cudaMemcpy2D(devSrc, srcStep,readbuff, nSrcW * sizeof(Npp32f), nSrcW * sizeof(Npp32f), nSrcH, cudaMemcpyHostToDevice));

        NppiSize oDstSize = {nDstW, nDstH};
        NppiRect oDstROI = {0, 0, nDstW, nDstH};
        float *devDst;
        CUDA_CALL(cudaMallocPitch((void**)&devDst, &dstStep, nDstW * sizeof(float), nDstH));

        NppStatus result = nppiResize_32f_C1R(devSrc,srcStep,oSrcSize,oSrcROI,devDst,dstStep,oDstSize,oDstROI,NPPI_INTER_SUPER);
        if (result != NPP_SUCCESS) {
            std::cerr << "Unable to run decimate_cuda, error " << result << std::endl;
        }

        Npp64s                 writesize;
        Npp32f                 *hostDst;
        writesize = (Npp64s)   nDstW * nDstH;         // Y
        if(NULL == (hostDst = (Npp32f *)malloc(writesize * sizeof(Npp32f)))){
            printf("Error : Unable to alloctae hostDst in decimate_cuda, exiting...\n");
            exit(1);
        }

        CUDA_CALL(cudaMemcpy2D(hostDst, nDstW * sizeof(Npp32f),devDst, dstStep, nDstW * sizeof(Npp32f),nDstH, cudaMemcpyDeviceToHost));
        CUDA_CALL(cudaFree(devSrc));
        CUDA_CALL(cudaFree(devDst));
        return(hostDst);
    }                            // source : Grayscale 1 x 32f, YYYY...
    else if (byteperpixel == 3){ // source : 3 x 32f interleaved RGBRGBRGB...
        size_t  srcStep;
        size_t  dstStep;
        // rows = height; columns = width

        NppiSize oSrcSize = {nSrcW, nSrcH};
        NppiRect oSrcROI = {0, 0, nSrcW, nSrcH};
        float *devSrc;
        CUDA_CALL(cudaMallocPitch((void**)&devSrc, &srcStep, 3 * nSrcW * sizeof(float), nSrcH));
        CUDA_CALL(cudaMemcpy2D(devSrc, srcStep,readbuff, 3 * nSrcW * sizeof(Npp32f), 3*nSrcW * sizeof(Npp32f), nSrcH, cudaMemcpyHostToDevice));

        NppiSize oDstSize = {nDstW, nDstH};
        NppiRect oDstROI = {0, 0, nDstW, nDstH};
        float *devDst;
        CUDA_CALL(cudaMallocPitch((void**)&devDst, &dstStep, 3 * nDstW * sizeof(float), nDstH));

        NppStatus result = nppiResize_32f_C3R(devSrc,srcStep,oSrcSize,oSrcROI,devDst,dstStep,oDstSize,oDstROI,NPPI_INTER_SUPER);
        if (result != NPP_SUCCESS) {
            std::cerr << "Unable to run decimate_cuda, error " << result << std::endl;
        }

        Npp64s                 writesize;
        Npp32f                 *hostDst;
        writesize = (Npp64s)   nDstW * nDstH * 3;          // RGB
        if(NULL == (hostDst = (Npp32f *)malloc(writesize * sizeof(Npp32f)))){
            printf("Error : Unable to alloctae hostDst in decimate_cuda, exiting...\n");
            exit(1);
        }

        CUDA_CALL(cudaMemcpy2D(hostDst, nDstW*3 * sizeof(Npp32f), devDst, dstStep, nDstW*3 * sizeof(Npp32f),nDstH, cudaMemcpyDeviceToHost));

        CUDA_CALL(cudaFree(devSrc));
        CUDA_CALL(cudaFree(devDst));
        return(hostDst);
    }        // source - 3 x 32f, interleaved RGBRGBRGB...

    return(0);
}

int main(){
uint32_t nSrcH = 480;
uint32_t nSrcW = 640;
uint8_t byteperpixel = 3;
float readbuff = (float )malloc(nSrcWnSrcHbyteperpixelsizeof(float));
for (int i = 0; i < nSrcHnSrcW; i++){
readbuff [i*3+0] = 1.0f;
readbuff [i*3+1] = 2.0f;
readbuff [i*3+2] = 3.0f;}
uint32_t nDstW = nSrcW/2;
uint32_t nDstH = nSrcH/2;
float res = decimate_cuda(readbuff, nSrcH, nSrcW, nDstH, nDstW, byteperpixel);
for (int i = 0; i < nDstHnDstW*byteperpixel; i++) if (res[i] != ((i%3)+1.0f)) {std::cout << "error at: " << i << std::endl; return 0;}
return 0;
}

Topic		Replies	Views
Why output of NVIDIA Resize (nppiResize_8u_C3R ) function differs so much from the opencv Resize fun... CUDA Programming and Performance	2	951	October 26, 2018
Why output of NVIDIA Resize (nppiResize_8u_C3R ) function differs so much from the opencv Resize fun... General	1	1131	October 26, 2018
npp nppiResize_8u_C1R gives unexpected result GPU-Accelerated Libraries	4	1249	January 1, 2020
problem with NPP image resize function --nppiResize_8u_C1R CUDA Programming and Performance	3	2800	March 16, 2012
The NPPIRESIZE function does not output a value in different cases GPU-Accelerated Libraries npp	3	717	August 5, 2023
nppiResize_8u_C3R function of cuda 10.1 outputs a wrong result GPU-Accelerated Libraries	0	974	August 22, 2019
nppi Resize vs Ippi Resize GPU-Accelerated Libraries	0	941	March 30, 2017
What does nppiResize_8u_C1R do when shrinking ? CUDA Programming and Performance	9	3761	February 27, 2012
NVPP nppiResize return error? NVPP CUDA Programming and Performance	0	5071	May 27, 2010
nppiResizeSqrPixel_8u_C1R() produces low quality images GPU-Accelerated Libraries	5	2534	September 8, 2016

NPP library functions nppiResize_8U_C3R and nppiBGRToLab_8u_C3R differ from cv::resize() output

Related topics