Eliminate upload/download for OpenCV cuda::GpuMat using shared memory?

knoname · October 10, 2019, 9:54pm

Is there a way to eliminate the upload/download steps to convert a cv::Mat object to a cv::cuda::GpuMat object on the Nano since the GPU and CPU can both access the same memory? I’m looking for a solution to take advantage of the OpenCV cuda functionality, but avoid the penalty of copying from Mat to GpuMat and back again.

// cpu_img has data but requires processing best done on a gpu.
cv::Mat cpu_img(imgsize, CV_8UC4);
cpu_img = somefunc();
cv::cuda::GpuMat gpu_img(imgsize, CV_8UC4);

gpu_img.upload(cpu_img);
// do something interesting to gpu_img involving the gpu...
gpu_img.download(new_cpu_img);

// ... continue on

I appreciate any suggestions!

I’m using a Nano, Jetpack 4.2.2, and OpenCV 3.4.6 built with cuda support.

Honey_Patouceul · October 10, 2019, 9:59pm

Yes, you can allocate unified memory for this purpose. It should provide both host and device addresses.
Note however that you may not be able to capture frames with opencv in unified memory without significant patching, but for further processing this may be efficient on jetson.

knoname · October 15, 2019, 7:12pm

I did some searching both before and after posting the original question to try to find examples. It seems like other developers may have seen the opportunity to take advantage of the Jetson product line ability to avoid the upload/download copying. However, I did not find any examples in the Nvidia forums but found one example here: Using OpenCV with CUDA on the Jetson TX2 - APIs - ximea support

Here is a simple example with upload/download working properly but copying from host to device and back again. The Jetson Nano (TX2, etc.) should be able to eliminate the upload/download steps:

//
//  SimpleTest.cpp
//  Loads an image, calls a GPU enabled function that uses opencv GPUMat upload/download
//

#include <opencv2/highgui.hpp>
#include <opencv2/imgproc.hpp>
#include <iostream>
#include <stdio.h>

// cuda stuff
#include <opencv2/cudaarithm.hpp>
// Nvidia cuda api
#include <cuda_runtime.h>

using namespace std;

cv::Mat testfunction(cv::Mat& h_original) {
    // receives a CPU/host based image, converts it to GPU/device based image
    // manipulates it, then converts back to CPU/host based result.
    cv::Mat h_result (h_original.size(), h_original.type());
    
    // create GPU/device images, same size and type as original host image
    cv::cuda::GpuMat d_original(h_original.size(), h_original.type());
    cv::cuda::GpuMat d_result(h_original.size(), h_original.type());
    
    // upload the original image from host to device
    d_original.upload(h_original);
    
    // perform a GPU operation of some sort.  Using threshold for simple placeholder
    cv::cuda::threshold(d_original, d_result, 128.0, 255.0, cv::THRESH_BINARY);

    // download the result image from device to host
    d_result.download(h_result);

    return h_result;
}

int main(int argc, char *argv[]) {
    
    cv::namedWindow("original image", cv::WINDOW_AUTOSIZE);
    cv::namedWindow("modified image", cv::WINDOW_AUTOSIZE );
    cv::String filename = "./lena.jpg";
    cv::Mat image, newimage;
    image = cv::imread(filename);
    if (image.empty()) {
        cout << "could not open or find the image" << endl;
        return -1;]
    }

    newimage = testfunction(image);
    
    cv::imshow("original image", image);
    cv::imshow("modified image", newimage);
    cv::waitKey(0);
    cv::destroyAllWindows();
    
    return 0;
}

Below is an attempt to implement a similar example to the Ximea link. However, this code produces an error:

terminate called after throwing an instance of 'cv::Exception'
what():  OpenCV(3.4.6) [long path deleted...]/opencv-3.4.6/modules/cudev/include/opencv2/cudev/grid/detail/transform.hpp:315: error: (-217:Gpu API call) unspecified launch failure in function 'call'

Example (not working) code:

//
//  SimpleTestGPU.cpp
//  Loads an image, calls a GPU enabled function that uses opencv GPUMat
//  but eliminates copying from CPU host to GPU host using upload/download
//

#include <opencv2/highgui.hpp>
#include <opencv2/imgproc.hpp>
#include <iostream>
#include <stdio.h> 

// cuda stuff
#include <opencv2/cudaarithm.hpp>
// Nvidia cuda api
#include <cuda_runtime.h>

using namespace std;

cv::Mat testfunction(cv::Mat& h_original) {
    // receives a CPU/host based image, but avoids copying with upload/download
    // for the GPU manipulation.  The result is returned in the CPU/host based h_result.
    cv::Mat h_result (h_original.size(), h_original.type());

// Define pointers used to create GpuMat's
    void *originalptr;
    void *resultptr;
    
    // Convert CPU/host original images to pointers
    cudaHostGetDevicePointer(&originalptr, h_original.data, 0);
    cudaHostGetDevicePointer(&resultptr, h_result.data, 0);

    // Create GpuMats from the device pointers
    cv::cuda::GpuMat d_original(h_original.size(), h_original.type(), originalptr);
    cv::cuda::GpuMat d_result(h_result.size(), h_result.type(), resultptr);

    // perform a GPU operation of some sort.  Using threshold for simple placeholder
    cv::cuda::threshold(d_original, d_result, 128.0, 255.0, cv::THRESH_BINARY);
    
    // no need to download or copy the result image from device to host - it already
    // resides in h_result.
    return h_result;
}

int main(int argc, char *argv[]) {
    
    cv::namedWindow("original image", cv::WINDOW_AUTOSIZE);
    cv::namedWindow("modified image", cv::WINDOW_AUTOSIZE );
    cv::String filename = "./lena.jpg";
    cv::Mat image, newimage;
    image = cv::imread(filename);
    if (image.empty()) {
        cout << "could not open or find the image" << endl;
        return -1;
    }
    newimage = testfunction(image);
    
    cv::imshow("original image", image);
    cv::imshow("modified image", newimage);
    cv::waitKey(0);
    cv::destroyAllWindows();
    
    return 0;
}

The method used to share the image.data does not work, but not sure why or what it should be. Any insights or suggestions would be greatly appreciated.

Honey_Patouceul · October 15, 2019, 10:13pm

Below is an example where frames are read from CSI camera, copied to a Mat with buffer allocated in either pinned memory or unified memory, then processed on GPU (sobel filter), then displayed (if opencv has been built with OPENGL support, it will also display from gpu mat in second window):

#include <iostream>
#include <cuda_runtime.h> 

#include <opencv2/opencv.hpp>
#include <opencv2/videoio.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/cudafilters.hpp> 

//comment this definition for using pinned memory instead of unified memory
#define USE_UNIFIED_MEM   

int main()
{
     //std::cout << cv::getBuildInformation() << std::endl; 

    const char* gst = "nvarguscamerasrc  ! video/x-raw(memory:NVMM), format=(string)NV12, width=(int)640, height=(int)480, framerate=(fraction)30/1 ! \
			nvvidconv    ! video/x-raw, format=(string)BGRx, framerate=(fraction)30/1 ! \
  			videoconvert ! queue ! video/x-raw, format=(string)BGR, framerate=(fraction)30/1 ! \
			appsink"; 

    cv::VideoCapture cap(gst, cv::CAP_GSTREAMER);
    if(!cap.isOpened()) {
	    std::cout<<"Failed to open camera."<<std::endl;
	    return (-1);
    }
    
    unsigned int width  = cap.get(cv::CAP_PROP_FRAME_WIDTH); 
    unsigned int height = cap.get(cv::CAP_PROP_FRAME_HEIGHT); 
    unsigned int fps    = cap.get(cv::CAP_PROP_FPS);
    unsigned int pixels = width*height;
    std::cout <<"Frame size : "<<width<<" x "<<height<<", "<<pixels<<" Pixels "<<fps<<" FPS"<<std::endl;

    cv::namedWindow("frame_out", cv::WINDOW_AUTOSIZE );
    bool hasOpenGlSupport = true;
    try {
        cv::namedWindow("d_frame_out", cv::WINDOW_AUTOSIZE | cv::WINDOW_OPENGL);
    }
    catch(cv::Exception& e) {
	    hasOpenGlSupport = false;
    }

    unsigned int frameByteSize = pixels * 3; 

#ifndef USE_UNIFIED_MEM
    /* Pinned memory. No cache */
    std::cout << "Using pinned memory" << std::endl;
    void *device_ptr, *host_ptr;
    cudaSetDeviceFlags(cudaDeviceMapHost);
    cudaHostAlloc((void **)&host_ptr, frameByteSize, cudaHostAllocMapped);
    cudaHostGetDevicePointer((void **)&device_ptr, (void *) host_ptr , 0);
    cv::Mat frame_out(height, width, CV_8UC3, host_ptr);
    cv::cuda::GpuMat d_frame_out(height, width, CV_8UC3, device_ptr);
#else
    /* Unified memory */
    std::cout << "Using unified memory" << std::endl;
    void *unified_ptr;
    cudaMallocManaged(&unified_ptr, frameByteSize);
    cv::Mat frame_out(height, width, CV_8UC3, unified_ptr);
    cv::cuda::GpuMat d_frame_out(height, width, CV_8UC3, unified_ptr);
#endif

    cv::Ptr< cv::cuda::Filter > filter = cv::cuda::createSobelFilter(CV_8UC3, CV_8UC3, 1, 1, 1, 1, cv::BORDER_DEFAULT);
    cv::Mat frame_in;

    while(1)
    {
    	if (!cap.read(frame_in)) {
		    std::cout<<"Capture read error"<<std::endl;
		    break;
	    }
	    else  {
	        frame_in.copyTo(frame_out);
	        // no need to copy to device
	        filter->apply(d_frame_out, d_frame_out);
	        if (hasOpenGlSupport)
	        	cv::imshow("d_frame_out", d_frame_out);
		    // no need to copy back to host
		    cv::imshow("frame_out", frame_out); 
		    cv::waitKey(1); 
	    }	
    }

    cap.release();

    return 0;
}

knoname · October 16, 2019, 4:09pm

Honey, that is a great example. Thank you!

I updated my code using your example to help sort out my logic (posted here in case it helps someone else as a second example)

//
//  SimpleTestGPU.cpp
//  Loads an image, calls a GPU enabled function that uses opencv GPUMat.
//  Eliminates copying from GPU device to CPU host using download for the result, but
//  still requires copying from a pre-existing image (not in a unified memory data
//  structure) using the opencv upload from cv::Mat to cv::cuda::GpuMat.
//

#include <opencv2/highgui.hpp>
#include <opencv2/imgproc.hpp>
#include <iostream>
#include <stdio.h> 

// cuda stuff
#include <opencv2/cudaarithm.hpp>
// Nvidia cuda api
#include <cuda_runtime.h>

using namespace std;

cv::Mat testfunction(cv::Mat& h_original) {
    // receives a CPU/host based image.  It must use upload to get image data
    // that is already in host memory into device memory.  From there, it uses
    // unified memory to avoid the download back to host memory.
    // The result is returned in the CPU/host based h_result with no explicit copy
    // required from d_result because they are in unified memory.
    
    unsigned int width  = h_original.size().width;
    unsigned int height = h_original.size().height;
    unsigned int channels = h_original.channels();
    unsigned int pixels = width*height;
    unsigned int frameByteSize = pixels * channels;
    std::cout <<"Frame size : "<<width<<" x "<<height<<", "<<pixels<<" Pixels "<<channels<<" channels"<<std::endl;
    
    // create a device original image data structure and upload (copy) the original image from host to device space
    // the upload (copy) does not look like it can be avoided for an array that already resides in host memory --
    // in order to avoid the upload/copy, whatever generated the data would need to place the array in a unified
    // memory data structure that was created in advance.  There does not appear to be a way to retroactively tag
    // data residing in host memory as a unified data structure.
    cv::cuda::GpuMat d_original(height, width, h_original.type());
    d_original.upload(h_original);

    // Define pointer used to create result GpuMat
    void *resultptr;
    // allocate unified memory space for result image and assign it to the resultptr
    cudaMallocManaged(&resultptr, frameByteSize);
    // create the host data structure reference to the result image
    cv::Mat h_result(height, width, h_original.type(), resultptr);
    // create the device data structure reference to the result image
    cv::cuda::GpuMat d_result(height, width, h_original.type(), resultptr);
    
    // perform a GPU operation of some sort.  Using threshold for simple placeholder
    cv::cuda::threshold(d_original, d_result, 128.0, 255.0, cv::THRESH_BINARY);
    
    // no need to download or copy the result image from device to host - it already
    // resides in h_result.
    return h_result;
}

int main(int argc, char *argv[]) {
    
    cv::namedWindow("original image", cv::WINDOW_AUTOSIZE);
    cv::namedWindow("modified image", cv::WINDOW_AUTOSIZE );
    cv::String filename = "./lena.jpg";
    cv::Mat image, newimage;
    image = cv::imread(filename);
    if (image.empty()) {
        cout << "could not open or find the image" << endl;
        return -1;
    }
    newimage = testfunction(image);
    
    cv::imshow("original image", image);
    cv::imshow("modified image", newimage);
    cv::waitKey(0);
    cv::destroyAllWindows();
    
    return 0;
}

I also did some further research and would like to confirm my assumptions:

if a Mat (or some sort of data array) already exists in host memory but not in an explicit unified memory data structure (i.e. h_original in testfunction), I assume there is not a method to migrate it to unified memory without doing a copy of some sort?
expanding on my assumption above: I believe this is because host memory is a superset of physical memory (swap files, etc.) and the data may not necessarily reside in a region that can be retroactively declared as unified memory?
another line of thought: if a pointer to an array is passed to a function (i.e. h_original in testfunction), is there a way to determine if the data resides in unified memory (or device memory)? My logic is that if this is possible, I might be able to create a more generic function and only do the copy when needed (I know testfunction presently receives only cv::Mat (host) references the way it is written today, but I'm interested in defining a more generic function that is simply passed a pointer to data.)

I appreciate the help and time to generate the example you posted! You helped me eliminate at least 50% of the performance bottleneck I was after – thank you!!

Honey_Patouceul · October 16, 2019, 8:07pm

I don’t know for sure, but my understanding or feeling so far is that you would have to allocate with a special allocator for sharing memory, so it should be done before.
I don’t know any way to test if a buffer has been allocated in unified memory. It may exist ways I don’t know, though, but I think the probabilty that you face an unexpected unified memory buffer from your code is very low, so is it worth a try?

I’d also suggest the following structure for your example:

//
//  SimpleTestGPU.cpp
//  Loads an image, calls a GPU enabled function that uses opencv GPUMat.
//  Eliminates copying from GPU device to CPU host using download for the result, but
//  still requires copying from a pre-existing image 
//

#include <opencv2/highgui.hpp>
#include <opencv2/imgproc.hpp>
#include <iostream>
#include <stdio.h> 

// cuda stuff
#include <opencv2/cudaarithm.hpp>
// Nvidia cuda api
#include <cuda_runtime.h>

using namespace std;

void OpencvCudaProcess(const cv::cuda::GpuMat& d_original, cv::cuda::GpuMat& d_result) {
	// perform a GPU operation of some sort.  Using threshold for simple placeholder
	cv::cuda::threshold(d_original, d_result, 128.0, 255.0, cv::THRESH_BINARY);
}

int main(int argc, char *argv[]) {

	cv::namedWindow("original image", cv::WINDOW_AUTOSIZE);
	cv::namedWindow("modified image", cv::WINDOW_AUTOSIZE );
	cv::String filename = "/usr/share/visionworks/sources/data/lena.jpg";

	// Max size buffer. Be aware of your max number of channels and format size (UC, F32, F64) for computing max byte size 
	const unsigned int maxImageByteSize = 1920*1080*3;

	void *image_in, *image_out;
    	// allocate unified memory space for INPUT image 
    	if (cudaSuccess != cudaMallocManaged(&image_in, maxImageByteSize))
		return (-1);
     	// allocate unified memory space for OUTPUT image 
   	if (cudaSuccess != cudaMallocManaged(&image_out, maxImageByteSize))
		return (-2);

        /* Process image, this block may be in a loop */
	{
		cv::Mat read_img = cv::imread(filename);
		if (read_img.empty()) {
		    cout << "could not open or find the image" << endl;
		    return (-3);
		}

		/* Prepare appropriate Mats (headers for size and format, specifying our unified memory buffers */
 		cv::Mat image(read_img.rows, read_img.cols, read_img.type(), image_in);
		cv::Mat newimage(read_img.rows, read_img.cols, read_img.type(), image_out);

		/* Prepare appropriate GpuMats (headers for size and format, specifying our unified memory buffers */ 
		cv::cuda::GpuMat d_image(read_img.rows, read_img.cols, read_img.type(), image_in);
		cv::cuda::GpuMat d_newimage(read_img.rows, read_img.cols, read_img.type(), image_out);

// Both options should work, you would check timings for your case
//#define COPY_TO_CPU
#ifdef COPY_TO_CPU
		read_img.copyTo(image);
		if (image.data != image_in) {
			std::cerr << "Error: image data buffer changed from initial unified memory" << std::endl;
			return (-4);
		}
#else /* COPY_TO_GPU */
		d_image.upload(read_img);
		if (d_image.data != image_in) {
			std::cerr << "Error: d_image data buffer changed from initial unified memory" << std::endl;
			return (-5);
		}	
#endif

		/* Process image on GPU */
		OpencvCudaProcess(d_image, d_newimage);

		cv::imshow("original image", image);

		/* Display result from CPU Mat without copy */
		cv::imshow("modified image", newimage);

		/* Wait for key press */
		cv::waitKey(-1);
	}

	cudaFree(image_in);
	cudaFree(image_out);
	cv::destroyAllWindows();

	return 0;
}

knoname · October 17, 2019, 2:40pm

I might be misunderstanding the last post. The problem I’m attempting to solve is that I’m receiving a pointer to image data from a 3rd party library. The 3rd party library is a black box – I don’t have any control over where it put the image in memory.

I am in the process of making several image processing steps that might be arranged in a different sequence depending on the problem at hand. I was hoping for a creative way to save the last upload out of the process. Your initial solution cleared the biggest hurdle of how to properly setup unified memory. I can make a generic shim that goes between any custom filters and the 3rd party library that simply moves the data one time into unified memory and I can control it from there.

Thank you very much for the help. I hope this is useful for others too!

jgchxg9o · February 11, 2020, 9:24am

Hello,

I am also interested in removing the upload/download by using the unified memory of the Jeton Nano. In my case, I get an image, stored at GPU as it is part of a GStream pipeline, which I want to access it from the CPU. In order to do that I do the following:

void* unifiedPtrMat;
   uint bytesMat = dest_height*dest_width*4;
   cudaMallocManaged(&unifiedPtrMat, bytesMat);
   cv::cuda::GpuMat auxMat( dest_height, dest_width, 
                          CV_8UC4, eglFrame.frame.pPitch[0]);

   cv::cuda::GpuMat d_mat( dest_height, dest_width, 
                          CV_8UC4, unifiedPtrMat);
   cv::Mat h_mat ( dest_height, dest_width, 
                    CV_8UC4, unifiedPtrMat);
    
   auxMat.copyTo(d_mat);
    
   //some Code ......

   cudaFree(unifiedPtrMat);

The program works for several frames, each time is different until I get: Bus error (core dumped).

EDIT: I did more research into the problem and the problem comes from copying the h_mat to another Mat variable either using

h_mat.copyTo(<Mat>)

or

<Mat> = h_map.clone()

.
Any idea on how to solve this?

Solved: I solved the core dumped error by synchronizing the host and the device with this function before //some code …:

cudaDeviceSynchronize

However, the code performs 50% slower when allocating by copyTo the image than when is not allocated, as the frame accessed from the host is only used to save it. imwrite function was commented when checking the comparison.

Thank you.

leonitous.conforti · July 16, 2020, 10:47pm

What compiler flags did you guys use to compile the cpp code? I can’t seem to compile cpp code with

#include <cuda_runtime.h>

on the jetson-nano

laurent.delaon · July 23, 2020, 10:50am

bonjour Honey_Patouceul
sur TX2 avec jetpack 4.3 opencv 3.9.11 cuda10.0 j’ai essayé de compiler ton code avec cmake.txt:

[details=“Summary”]
This text will be hidden

cmake_minimum_required(VERSION 2.8)
project( umem_lena )
find_package( OpenCV REQUIRED )
find_package(CUDA REQUIRED)

include_directories( ${OpenCV_INCLUDE_DIRS} ${CUDA_INCLUDE_DIRS} )
add_executable( tst_umem_0 tst_umem_0.cpp )
target_link_libraries( tst_umem_0 ${CUDA_LIBRARIES} ${OpenCV_LIBS} )

Cela marche …

Laurent.

JohanPi · September 10, 2020, 6:29pm

This does not work for me, I still get a

fatal error: cuda_runtime.h: No such file or directory
#include <cuda_runtime.h>

Do anyone have a suggestion?

Honey_Patouceul · September 10, 2020, 7:31pm

Not sure about what doesn’t work. I haven’t tested with cmake, but it should work with opencv4 and the following compile script
(1- Adjust OPENCV_DIR to where is installed your opencv4 version.
2 - Not sure all opencv libs here are required, but unused ones would be dropped by linker):

export OPENCV_DIR=/usr/local
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$OPENCV_DIR/lib
g++ -std=c++11 -Wall -I/$OPENCV_DIR/include/opencv4 -I/usr/local/cuda/targets/aarch64-linux/include  SimpleTestGPU.cpp -L$OPENCV_DIR/lib -lopencv_core -lopencv_video -lopencv_videoio -lopencv_highgui -lopencv_imgcodecs -lopencv_cudaarithm -lopencv_cudawarping -L/usr/local/cuda/targets/aarch64-linux/lib -lcudart -o SimpleTestGPU

JohanPi · September 11, 2020, 8:49am

Thanks,

@laurent.delaon’s cmake worked fine after all, I just missed to add it to a dependent project as well.

laurent.delaon · September 11, 2020, 9:31am

Happy for you! (this is not evidently these kind of pb…)

Topic		Replies	Views
Shared memory between GPU and CPU Jetson Nano	12	2555	October 14, 2021
Zero Copy Memory vs Unified memory CUDA processing Jetson TX1	28	21223	October 18, 2021
Unified memory with CUDA on Jetson Nano needs memcpy? Jetson Nano cuda	9	2514	October 18, 2021
Opencv cuda convolution extremly slower than bare cuda code convolution on Jetson Nano using unified memory Jetson Nano opencv	12	3989	October 18, 2021
gstreamer NVMM <-> opencv gpuMat Jetson TX2 opencv	58	22802	October 18, 2021
Initialize unified memory Jetson TX2 camera , cuda	2	402	October 18, 2021
How to create opencv gpumat from nvstream? DeepStream SDK	36	15527	July 27, 2021
Acessing CUDA unified memory after OpenCV GPU operation CUDA Programming and Performance	0	90	December 18, 2024
Unified Memory in CUDA 6 Technical Blog	87	2765	August 16, 2019
Handing off cudaImage object to OpenCV CUDA function? (expects CV::MAT) Jetson Xavier NX opencv	12	4162	October 18, 2021

Eliminate upload/download for OpenCV cuda::GpuMat using shared memory?

Related topics