npp nppiResize_8u_C1R gives unexpected result

Hi,

I am trying to use nppiResize_8u_C1R on jetson xavier jetpack 4.3,
I went over the manuals and wrote a simple example but can not get the expected result.
The output image looks like white noise.
The function returns without an error.
Please see code below.
Can anyone please help?

Thanks,
Shmulik

cv::Mat cvImageLeft = cv::imread(strLeftFileName, cv::IMREAD_GRAYSCALE);
cv::Mat cvOut(cvImageLeft.rows/2,cvImageLeft.cols/2,cvImageLeft.type());
       
Npp8u * 	pSrc = cvImageLeft.data ;
int 	nSrcStep = cvImageLeft.cols; 
NppiSize 	oSrcSize;
oSrcSize.width = cvImageLeft.cols; 
oSrcSize.height = cvImageLeft.rows; 
NppiRect 	oSrcRectROI;
oSrcRectROI.width = cvImageLeft.cols;
oSrcRectROI.height = cvImageLeft.rows;
		
Npp8u * 	pDst = cvOut.data;
int 	nDstStep = cvOut.cols;
NppiSize  oDstSize;
oDstSize.width = cvOut.cols;  
oDstSize.height = cvOut.rows;  
NppiRect 	oDstRectROI;
oDstRectROI.width = cvOut.cols;
oDstRectROI.height = cvOut.rows;
int 	eInterpolation = 1; // my guess bilinear
NppStatus status;
		
status = nppiResize_8u_C1R( pSrc,  nSrcStep, oSrcSize,  oSrcRectROI, 
                            pDst,  nDstStep, oDstSize,  oDstRectROI,  eInterpolation);
								 
if(status == NPP_SUCCESS)						 
    cv::imwrite("resize.png", cvOut);
else
    throw std::runtime_error("NPP NOT SUCCESS");
		
return 0;
  1. Please use the code formatting tools available to you. Edit your posting above, and look at the top toolbar above the edit window. Select the text that is actually code, then press the </> button to wrap it in a code marker.

There shouldn’t be any reason to guess. Use the desired/documented enum/define value. For example NPPI_INTER_LINEAR

https://docs.nvidia.com/cuda/npp/group__image__resize.html

  1. NPP, like most CUDA libraries, expects its input data to be in GPU memory (and its output data will be placed in GPU memory also). a cv::Mat AFAIK is host memory. You cannot pass data pointers obtained from cv::Mat directly to a NPP function (you probably could if it were a cv::GpuMat). Therefore you will need to copy the input data to the device before attempting to do the resize, and you will need to copy the resize results back to host memory.

  2. There may be any number of other problems as well, such as a broken CUDA install. You should run your code with cuda-memcheck, and make sure that cuda-memcheck reports no errors, before assuming that you have coded things correctly. The error checking from NPP by itself is insufficient, because NPP like many CUDA libraries, may issue functions asynchronously, which means that the runtime error (from use of incorrect pointers here) will not be immediately evident, when the function returns. The function may be returning control to the host thread before the operation has completed, or perhaps even started.

  3. There are CUDA sample codes that include NPP resizing. For example jpegNPP is one of them.

Thanks for the prompt reply.

My new code now uses the dynamic allocation, but when trying to copy from/to the allocated buffers I’m getting segmentation fault, so I’m assuming I should also use some type of copy function other than memcpy, I just can’t find out which one.

Also, how can I synchronize the operations?

Shmulik.

cv::Mat cvImageLeft = cv::imread(strLeftFileName, cv::IMREAD_GRAYSCALE);
#ifdef MY_DYN_ALLOC
	int 	nSrcStep;
	Npp8u * pSrc = nppiMalloc_8u_C1(cvImageLeft.cols, cvImageLeft.rows, &nSrcStep);
	fprintf(stdout, "Before memcpy 1\n");
	memcpy(pSrc, cvImageLeft.data, cvImageLeft.rows*cvImageLeft.cols);
	fprintf(stdout, "After  memcpy 1\n");
#else
	int 	nSrcStep = cvImageLeft.cols;
	Npp8u * pSrc = cvImageLeft.data;
#endif

	NppiSize oSrcSize = {cvImageLeft.cols, cvImageLeft.rows};
	NppiRect oSrcRectROI = {cvImageLeft.cols, cvImageLeft.rows};

	// output file is scaled in 1/2 in x and y axis
	cv::Mat cvOut(cvImageLeft.rows/2,cvImageLeft.cols/2,cvImageLeft.type());
#ifdef MY_DYN_ALLOC
	int 	nDstStep = cvOut.cols;
	Npp8u * pDst = cvOut.data;
#else
	int 	nDstStep;
	Npp8u * pDst = nppiMalloc_8u_C1(cvImageLeft.cols/2, cvImageLeft.rows/2, &nDstStep);
#endif

	NppiSize  oDstSize = {cvOut.cols, cvOut.rows};
	NppiRect oDstRectROI = {cvOut.cols, cvOut.rows};

	int 	eInterpolation = NPPI_INTER_LINEAR;
	NppStatus status;
		
	status = nppiResize_8u_C1R(pSrc, nSrcStep, oSrcSize, oSrcRectROI, 
                                   pDst, nDstStep, oDstSize, oDstRectROI, 
                                   eInterpolation);
								 
	if(status == NPP_SUCCESS)
	{						 
#ifdef MY_DYN_ALLOC
		fprintf(stdout, "Before memcpy 2\n");
		memcpy(cvOut.data,pDst,cvImageLeft.cols/2*cvImageLeft.rows/2);
		fprintf(stdout, "After  memcpy 2\n");
		nppiFree(pDst);
		nppiFree(pSrc);
#endif
		cv::imwrite("resize.png", cvOut);
	}
	else
		throw std::runtime_error("NPP NOT SUCCESS");

	return 0;

Did you look at the sample code I suggested?
In order to use NPP and most CUDA libraries effectively, its necessary to have some working knowledge of CUDA. The copy operation you are looking for is cudaMemcpy.

In ordinary usage, cudaMemcpy is a synchronizing operation. The act of copying the results from device to host will force the previously issued device activity to complete, before the copy operation commences.

As an aside, also note that in a Jetson environment, host and device memory are unified, physically. It’s often more efficient to skip the device memory allocations altogether, and the copy operations altogether, and just do your memory allocation using e.g. cudaHostAlloc. The pointers returned by cudaHostAlloc can be used directly by NPP. However this isn’t going to be as useful or helpful if you are starting with an allocation created by cv::Mat

If you were to use a destination memory allocation from cudaHostAlloc, you would indeed need to synchronize before expecting the results to be valid. In that case, cudaDeviceSynchronize() is one possible choice.

Also note that in your two usages of MY_DYN_ALLOC, you have the sense of if and else reversed. It is broken, according to my view.

Finally got it working.
Code is posted below.

Thanks Robert.

inline int findCudaDevice()
{
    cudaDeviceProp deviceProp;
    int devID = 0;

// Otherwise pick the device with highest Gflops/s
    devID = gpuGetMaxGflopsDeviceId();
    checkCudaErrors(cudaSetDevice(devID));
    checkCudaErrors(cudaGetDeviceProperties(&deviceProp, devID));
    printf("GPU Device %d: \"%s\" with compute capability %d.%d\n\n", devID, deviceProp.name, deviceProp.major, deviceProp.minor);

    return devID;
}    


inline int cudaDeviceInit()
{
    int deviceCount;
    checkCudaErrors(cudaGetDeviceCount(&deviceCount));

    if (deviceCount == 0)
    {
        std::cerr << "CUDA error: no devices supporting CUDA." << std::endl;
        exit(EXIT_FAILURE);
    }

    int dev = findCudaDevice();

    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, dev);
    std::cerr << "cudaSetDevice GPU" << dev << " = " << deviceProp.name << std::endl;

    checkCudaErrors(cudaSetDevice(dev));

    return dev;
}

int main(int argc, char *argv[])
{

       // Load the input images
        cv::Mat cvImageLeft = cv::imread(strLeftFileName, cv::IMREAD_GRAYSCALE);
        if (cvImageLeft.empty())
        {
            throw std::runtime_error("Can't open '" + strLeftFileName + "'");
        }


// initalize cuda device
        int devID = cudaDeviceInit();
        if ( devID != 0) 
				throw std::runtime_error("cudaDeviceInit fail ");
		
		cudaError_t cudaRet ;
		 
		int 	nSrcStep;
		
		// need to alloc cuda memory for source
		Npp8u * pSrc = nppiMalloc_8u_C1(cvImageLeft.cols, cvImageLeft.rows, &nSrcStep);
		
		printf("nSrcStep %d \n", nSrcStep);
		
		
		
		// Need to copy image from Host to GPU Pay attention GPU memory is in power of 2 thus stride copy is required
		for(int i=0; i< cvImageLeft.rows ; i++)
			cudaRet = cudaMemcpy(pSrc + i*nSrcStep, cvImageLeft.data + i*cvImageLeft.cols , cvImageLeft.cols,cudaMemcpyHostToDevice);
		
		if (cudaRet != cudaSuccess)
			throw std::runtime_error("cudaMemcpyHostToDevice fail ");
		

		// Need to define input {width height}
		NppiSize oSrcSize = {cvImageLeft.cols, cvImageLeft.rows};
		
		// Need to define input ROI  {upper left x, upper left y, ROI width, ROI height} 
		NppiRect oSrcRectROI = {0, 0, cvImageLeft.cols, cvImageLeft.rows};

        // output file is scaled in 1/2 in x and y axis
		cv::Mat cvOut(cvImageLeft.rows/2,cvImageLeft.cols/2,cvImageLeft.type());

		int 	nDstStep;
		
		// need to alloc cuda memory for destenation
		Npp8u * pDst = nppiMalloc_8u_C1(cvImageLeft.cols/2, cvImageLeft.rows/2, &nDstStep);
		
		printf("nDstStep %d \n", nDstStep);

		// Need to define output {width height}
		NppiSize oDstSize = {cvOut.cols, cvOut.rows};
		
		// Need to define output ROI  {upper left x, upper left y, ROI width, ROI height} 
		NppiRect oDstRectROI = {0, 0, cvOut.cols, cvOut.rows};

		int eInterpolation = NPPI_INTER_LINEAR;
		NppStatus status;
		
		status = nppiResize_8u_C1R(pSrc, nSrcStep, oSrcSize, oSrcRectROI, 
                                   pDst, nDstStep, oDstSize, oDstRectROI,
                                   eInterpolation);

		if(status == NPP_SUCCESS)
		{ 
			
			// Need to copy image from GPU to HOST Pay attention GPU memory is in power of 2 thus stride copy is required
			for(int i=0; i< cvOut.rows ; i++)
				cudaRet = cudaMemcpy(cvOut.data + i*cvOut.cols ,pDst + i*nDstStep,cvOut.cols,cudaMemcpyDeviceToHost);
			
			if (cudaRet != cudaSuccess)
				throw std::runtime_error("cudaMemcpyDeviceToHost fail ");
			
			nppiFree(pDst);
			nppiFree(pSrc);
			cv::imwrite("resize.png", cvOut);
		}
		else
			throw std::runtime_error("NPP NOT SUCCESS");
		
		return 0;
}