npp nppiResize_8u_C1R gives unexpected result

shmulikh · December 31, 2019, 3:19pm

Hi,

I am trying to use nppiResize_8u_C1R on jetson xavier jetpack 4.3,
I went over the manuals and wrote a simple example but can not get the expected result.
The output image looks like white noise.
The function returns without an error.
Please see code below.
Can anyone please help?

Thanks,
Shmulik

cv::Mat cvImageLeft = cv::imread(strLeftFileName, cv::IMREAD_GRAYSCALE);
cv::Mat cvOut(cvImageLeft.rows/2,cvImageLeft.cols/2,cvImageLeft.type());
       
Npp8u * 	pSrc = cvImageLeft.data ;
int 	nSrcStep = cvImageLeft.cols; 
NppiSize 	oSrcSize;
oSrcSize.width = cvImageLeft.cols; 
oSrcSize.height = cvImageLeft.rows; 
NppiRect 	oSrcRectROI;
oSrcRectROI.width = cvImageLeft.cols;
oSrcRectROI.height = cvImageLeft.rows;
		
Npp8u * 	pDst = cvOut.data;
int 	nDstStep = cvOut.cols;
NppiSize  oDstSize;
oDstSize.width = cvOut.cols;  
oDstSize.height = cvOut.rows;  
NppiRect 	oDstRectROI;
oDstRectROI.width = cvOut.cols;
oDstRectROI.height = cvOut.rows;
int 	eInterpolation = 1; // my guess bilinear
NppStatus status;
		
status = nppiResize_8u_C1R( pSrc,  nSrcStep, oSrcSize,  oSrcRectROI, 
                            pDst,  nDstStep, oDstSize,  oDstRectROI,  eInterpolation);
								 
if(status == NPP_SUCCESS)						 
    cv::imwrite("resize.png", cvOut);
else
    throw std::runtime_error("NPP NOT SUCCESS");
		
return 0;

Robert_Crovella · December 31, 2019, 3:55pm

Please use the code formatting tools available to you. Edit your posting above, and look at the top toolbar above the edit window. Select the text that is actually code, then press the </> button to wrap it in a code marker.

There shouldn’t be any reason to guess. Use the desired/documented enum/define value. For example NPPI_INTER_LINEAR

https://docs.nvidia.com/cuda/npp/group__image__resize.html

NPP, like most CUDA libraries, expects its input data to be in GPU memory (and its output data will be placed in GPU memory also). a cv::Mat AFAIK is host memory. You cannot pass data pointers obtained from cv::Mat directly to a NPP function (you probably could if it were a cv::GpuMat). Therefore you will need to copy the input data to the device before attempting to do the resize, and you will need to copy the resize results back to host memory.
There may be any number of other problems as well, such as a broken CUDA install. You should run your code with cuda-memcheck, and make sure that cuda-memcheck reports no errors, before assuming that you have coded things correctly. The error checking from NPP by itself is insufficient, because NPP like many CUDA libraries, may issue functions asynchronously, which means that the runtime error (from use of incorrect pointers here) will not be immediately evident, when the function returns. The function may be returning control to the host thread before the operation has completed, or perhaps even started.
There are CUDA sample codes that include NPP resizing. For example jpegNPP is one of them.

shmulikh · December 31, 2019, 4:30pm

Thanks for the prompt reply.

My new code now uses the dynamic allocation, but when trying to copy from/to the allocated buffers I’m getting segmentation fault, so I’m assuming I should also use some type of copy function other than memcpy, I just can’t find out which one.

Also, how can I synchronize the operations?

Shmulik.

cv::Mat cvImageLeft = cv::imread(strLeftFileName, cv::IMREAD_GRAYSCALE);
#ifdef MY_DYN_ALLOC
	int 	nSrcStep;
	Npp8u * pSrc = nppiMalloc_8u_C1(cvImageLeft.cols, cvImageLeft.rows, &nSrcStep);
	fprintf(stdout, "Before memcpy 1\n");
	memcpy(pSrc, cvImageLeft.data, cvImageLeft.rows*cvImageLeft.cols);
	fprintf(stdout, "After  memcpy 1\n");
#else
	int 	nSrcStep = cvImageLeft.cols;
	Npp8u * pSrc = cvImageLeft.data;
#endif

	NppiSize oSrcSize = {cvImageLeft.cols, cvImageLeft.rows};
	NppiRect oSrcRectROI = {cvImageLeft.cols, cvImageLeft.rows};

	// output file is scaled in 1/2 in x and y axis
	cv::Mat cvOut(cvImageLeft.rows/2,cvImageLeft.cols/2,cvImageLeft.type());
#ifdef MY_DYN_ALLOC
	int 	nDstStep = cvOut.cols;
	Npp8u * pDst = cvOut.data;
#else
	int 	nDstStep;
	Npp8u * pDst = nppiMalloc_8u_C1(cvImageLeft.cols/2, cvImageLeft.rows/2, &nDstStep);
#endif

	NppiSize  oDstSize = {cvOut.cols, cvOut.rows};
	NppiRect oDstRectROI = {cvOut.cols, cvOut.rows};

	int 	eInterpolation = NPPI_INTER_LINEAR;
	NppStatus status;
		
	status = nppiResize_8u_C1R(pSrc, nSrcStep, oSrcSize, oSrcRectROI, 
                                   pDst, nDstStep, oDstSize, oDstRectROI, 
                                   eInterpolation);
								 
	if(status == NPP_SUCCESS)
	{						 
#ifdef MY_DYN_ALLOC
		fprintf(stdout, "Before memcpy 2\n");
		memcpy(cvOut.data,pDst,cvImageLeft.cols/2*cvImageLeft.rows/2);
		fprintf(stdout, "After  memcpy 2\n");
		nppiFree(pDst);
		nppiFree(pSrc);
#endif
		cv::imwrite("resize.png", cvOut);
	}
	else
		throw std::runtime_error("NPP NOT SUCCESS");

	return 0;

Robert_Crovella · December 31, 2019, 4:54pm

Did you look at the sample code I suggested?
In order to use NPP and most CUDA libraries effectively, its necessary to have some working knowledge of CUDA. The copy operation you are looking for is cudaMemcpy.

In ordinary usage, cudaMemcpy is a synchronizing operation. The act of copying the results from device to host will force the previously issued device activity to complete, before the copy operation commences.

As an aside, also note that in a Jetson environment, host and device memory are unified, physically. It’s often more efficient to skip the device memory allocations altogether, and the copy operations altogether, and just do your memory allocation using e.g. cudaHostAlloc. The pointers returned by cudaHostAlloc can be used directly by NPP. However this isn’t going to be as useful or helpful if you are starting with an allocation created by cv::Mat

If you were to use a destination memory allocation from cudaHostAlloc, you would indeed need to synchronize before expecting the results to be valid. In that case, cudaDeviceSynchronize() is one possible choice.

Also note that in your two usages of MY_DYN_ALLOC, you have the sense of if and else reversed. It is broken, according to my view.

shmulikh · January 1, 2020, 8:27am

Finally got it working.
Code is posted below.

Thanks Robert.

inline int findCudaDevice()
{
    cudaDeviceProp deviceProp;
    int devID = 0;

// Otherwise pick the device with highest Gflops/s
    devID = gpuGetMaxGflopsDeviceId();
    checkCudaErrors(cudaSetDevice(devID));
    checkCudaErrors(cudaGetDeviceProperties(&deviceProp, devID));
    printf("GPU Device %d: \"%s\" with compute capability %d.%d\n\n", devID, deviceProp.name, deviceProp.major, deviceProp.minor);

    return devID;
}    


inline int cudaDeviceInit()
{
    int deviceCount;
    checkCudaErrors(cudaGetDeviceCount(&deviceCount));

    if (deviceCount == 0)
    {
        std::cerr << "CUDA error: no devices supporting CUDA." << std::endl;
        exit(EXIT_FAILURE);
    }

    int dev = findCudaDevice();

    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, dev);
    std::cerr << "cudaSetDevice GPU" << dev << " = " << deviceProp.name << std::endl;

    checkCudaErrors(cudaSetDevice(dev));

    return dev;
}

int main(int argc, char *argv[])
{

       // Load the input images
        cv::Mat cvImageLeft = cv::imread(strLeftFileName, cv::IMREAD_GRAYSCALE);
        if (cvImageLeft.empty())
        {
            throw std::runtime_error("Can't open '" + strLeftFileName + "'");
        }


// initalize cuda device
        int devID = cudaDeviceInit();
        if ( devID != 0) 
				throw std::runtime_error("cudaDeviceInit fail ");
		
		cudaError_t cudaRet ;
		 
		int 	nSrcStep;
		
		// need to alloc cuda memory for source
		Npp8u * pSrc = nppiMalloc_8u_C1(cvImageLeft.cols, cvImageLeft.rows, &nSrcStep);
		
		printf("nSrcStep %d \n", nSrcStep);
		
		
		
		// Need to copy image from Host to GPU Pay attention GPU memory is in power of 2 thus stride copy is required
		for(int i=0; i< cvImageLeft.rows ; i++)
			cudaRet = cudaMemcpy(pSrc + i*nSrcStep, cvImageLeft.data + i*cvImageLeft.cols , cvImageLeft.cols,cudaMemcpyHostToDevice);
		
		if (cudaRet != cudaSuccess)
			throw std::runtime_error("cudaMemcpyHostToDevice fail ");
		

		// Need to define input {width height}
		NppiSize oSrcSize = {cvImageLeft.cols, cvImageLeft.rows};
		
		// Need to define input ROI  {upper left x, upper left y, ROI width, ROI height} 
		NppiRect oSrcRectROI = {0, 0, cvImageLeft.cols, cvImageLeft.rows};

        // output file is scaled in 1/2 in x and y axis
		cv::Mat cvOut(cvImageLeft.rows/2,cvImageLeft.cols/2,cvImageLeft.type());

		int 	nDstStep;
		
		// need to alloc cuda memory for destenation
		Npp8u * pDst = nppiMalloc_8u_C1(cvImageLeft.cols/2, cvImageLeft.rows/2, &nDstStep);
		
		printf("nDstStep %d \n", nDstStep);

		// Need to define output {width height}
		NppiSize oDstSize = {cvOut.cols, cvOut.rows};
		
		// Need to define output ROI  {upper left x, upper left y, ROI width, ROI height} 
		NppiRect oDstRectROI = {0, 0, cvOut.cols, cvOut.rows};

		int eInterpolation = NPPI_INTER_LINEAR;
		NppStatus status;
		
		status = nppiResize_8u_C1R(pSrc, nSrcStep, oSrcSize, oSrcRectROI, 
                                   pDst, nDstStep, oDstSize, oDstRectROI,
                                   eInterpolation);

		if(status == NPP_SUCCESS)
		{ 
			
			// Need to copy image from GPU to HOST Pay attention GPU memory is in power of 2 thus stride copy is required
			for(int i=0; i< cvOut.rows ; i++)
				cudaRet = cudaMemcpy(cvOut.data + i*cvOut.cols ,pDst + i*nDstStep,cvOut.cols,cudaMemcpyDeviceToHost);
			
			if (cudaRet != cudaSuccess)
				throw std::runtime_error("cudaMemcpyDeviceToHost fail ");
			
			nppiFree(pDst);
			nppiFree(pSrc);
			cv::imwrite("resize.png", cvOut);
		}
		else
			throw std::runtime_error("NPP NOT SUCCESS");
		
		return 0;
}

Topic		Replies	Views
NPP library functions nppiResize_8U_C3R and nppiBGRToLab_8u_C3R differ from cv::resize() output General	10	4856	October 12, 2021
Using nppiResizeBatch_8u_C3R causes exception wrap illegal address GPU-Accelerated Libraries npp	3	806	August 24, 2022
Issues with nppiMean_StdDev_32f from the NPP library GPU-Accelerated Libraries	15	3360	October 31, 2017
Problem when using NPP libirary, nppiMinIndx_32f_C1R() GPU-Accelerated Libraries	8	1459	July 31, 2018
NPP function nppiCrossCorrFull_NormLevel_8u32f_C1R too slow??? CUDA Programming and Performance	8	1521	March 7, 2015
[closed]nppiRemap_8u_C3R function GPU-Accelerated Libraries	5	1899	June 11, 2016
nppiResize_8u_C3R function of cuda 10.1 outputs a wrong result GPU-Accelerated Libraries	0	934	August 22, 2019
CUDA memory copy (cudaMemcpy) fails after NPP sum function (nppiSum_8u_C3R) GPU-Accelerated Libraries npp	0	695	February 16, 2023
Same function called multiple times GPU-Accelerated Libraries	0	629	June 6, 2017
Very poor performance with NPP CrossCorrValid GPU-Accelerated Libraries npp	8	3296	May 25, 2022

npp nppiResize_8u_C1R gives unexpected result

Related topics