Handing off cudaImage object to OpenCV CUDA function? (expects CV::MAT)

I’m creating a computer vision application of sorts and for one of the features (besides object recognition which functions perfectly) I need to do motion detection. I use OpenCV for this.

I have it functioning perfectly if I use jetson.utils (python, by the way) to crop the image and then hand it off via jetson.utils.cudaToNumpy() to create the numpy array OpenCV needs.

But, the following functions of OpenCV I use (equalizehist, absdif, threshold, dilate, etc.) all run on CPU and I think that’s a waste of performance. These functions are all available on GPU/CUDA as well, but they require a CV::MAT object instead of a Numpy array. Without all sorts of matrix conversions and copy between GPU and CPU RAM, is there a way to hand off pointer to the cudaImage object as a CV::MAT? Or is there some conversion available that does the work in GPU?

I can’t find anything in the documentation about this, nor on the forum. I must be missing something because I wouldnt believe I would be the first person doing this… ;)

Hi @willemvdkletersteeg, do you mean using the C++ interface of OpenCV with cv::Mat? I have only seen numpy arrays used with cv2 from Python. If there is a cv::Mat in Python, I bet there is also a way to create it from the numpy array.

If you are using cv::Mat in C++, you should be able to create it from the CUDA pointer like so:

cv::Mat cv_image(cv::Size(imgWidth, imgHeight), CV_8UC3, imgCUDA);

Since images are allocated in jetson-inference in shared CPU/GPU memory, you can use the CUDA pointer directly on the CPU.

If you want to investigate cv::GpuMat from jetson-utils; you may check this topic.
Note that there would be no automatic translation to cuda processing on GPU, you would have to adapt your processing rewriting with opencv cuda.
Also note that cuda backend only provides a subset of opencv CPU functions, when available the API may differ, and when API is similar the results may also differ. But if you get it working it may be faster.

Thank you both for your response. Much appreciated. As said, I work in Python. But I may not have been clear what I’m trying to do. Excuse me. This is the code that currently works like a charm: (as in: it produces the results I want)

subFrame = jetson.utils.cudaAllocMapped(width=area.width, height=area.height, format=frame.format)
jetson.utils.cudaCrop(frame, subFrame, area.roi)

# First convert the frame (in GPU memory) to something OpenCV can use
cvFrame = jetson.utils.cudaAllocMapped(width=frame.width, height=frame.height, format="bgr8")

# TODO: convert to grayscale with CUDA/in GPU MEM?
jetson.utils.cudaConvertColor(frame, cvFrame)

# make sure the GPU is done working before we convert to cv2
jetson.utils.cudaDeviceSynchronize()

# convert to cv2 image (cv2 images are numpy arrays)
cvFrame = jetson.utils.cudaToNumpy(cvFrame)

# Convert to grayscale - TODO: do this sooner in the process
cvFrame = cv2.cvtColor(cvFrame, cv2.COLOR_BGR2GRAY)
cvFrame = cv2.equalizeHist(cvFrame)

frameDelta = cv2.absdiff(area.previous_frame, cvFrame)
thresh = cv2.threshold(frameDelta, 128, 255, cv2.THRESH_BINARY)
thresh = cv2.dilate(thresh, None, iterations=2)

But I have to run this every single frame (albeit, the regions that are cropped to are quite small) so I would like to optimize this and run everything in/on GPU. The OpenCV functions that I use are - as far as I know - all available in CUDA. So I wanted to change it to:

subFrame = jetson.utils.cudaAllocMapped(width=area.width, height=area.height, format=frame.format)
jetson.utils.cudaCrop(frame, subFrame, area.roi)

# First convert the frame (in GPU memory) to something OpenCV can use
cvFrame = jetson.utils.cudaAllocMapped(width=frame.width, height=frame.height, format="gray8")

# TODO: convert to grayscale with CUDA/in GPU MEM?
jetson.utils.cudaConvertColor(frame, cvFrame)

# make sure the GPU is done working before we convert to cv2
 jetson.utils.cudaDeviceSynchronize()

# This doesn't work:
# cvFrame = jetson.utils.cudaToNumpy(cvFrame)

cvFrame = cv2.cuda.equalizeHist(cvFrame)

# Detect areas with motion
frameDelta = cv2.cuda.absdiff(area.previous_frame, cvFrame)
thresh = cv2.cuda.threshold(frameDelta, 128, 255, cv2.THRESH_BINARY)
thresh = cv2.cuda.dilate(thresh, None, iterations=2)

But this doesn’t work because the cudaImage object that cudaConvertColor() produces can’t be given to the CV2 function(s). Also, converting to a Numpy array doesn’t work. The cv2.cuda.* functions expect a GpuMat object as input. How do I go about this efficiently?

OK, gotcha. I haven’t used the Python API for OpenCV’s CUDA functions before (cv2.cuda), but first try this:

gpu_frame = cv.cuda_GpuMat()
gpu_frame.upload(numpy_array)    # numpy_array is from cudaToNumpy()

Ideally you could use this constructor for GpuMat instead, which takes a user pointer and in theory would avoid the upload - however I can’t find a reference to this being done from Python since OpenCV has non-existent Python documentation.

My cudaImage object has a .ptr member with the CUDA memory address, should you be able to use the above constructor from Python. Then you could skip the whole numpy part.

Also, if you are running your code above in a loop (i.e. processing a video stream), you will not want to allocate the data each frame - instead allocate it beforehand, or allocate it on the first iteration of the loop.

Thanks! That actually works. What I do now is:

self.cv_frame = cv2.cuda_GpuMat(self.width, self.height, 0)

in the object’s constructor and then, in the loop, I run:

area.cv_frame.upload(jetson.utils.cudaToNumpy(area.bgr_sub_frame))

Which is probably not ideal because it technically downloads to hostmem and then re-uploads but I’m hoping this is still quite fast because it’s mapped? (zerocopy)

Anyhow, I can’t find any way to supply the GpuMat() constructor with a datapointer in python, afaik only the C++ implementation has/accepts such a pointer…

Thank you for the extra tip regarding the allocations, you are totally right! I alllocate it beforehand, now, and only do:

jetson.utils.cudaCrop(frame, area.sub_frame, area.roi)
jetson.utils.cudaConvertColor(area.sub_frame, area.bgr_sub_frame)
jetson.utils.cudaDeviceSynchronize()

in the processing loop before going to OpenCV. I hope I’m implementing this the most efficient way. Anyhow: it works! Thanks!