Python: cudaImage <-> OpenCV conversions very slow

mjasner · January 4, 2023, 5:41pm

I’m working on porting a project from a different platform (Raspberry Pi + Intel OpenVINO) to a Jetson Nano platform. The gist of the project is I need to read an image from a camera or video, do some pre-processing on it, and then run the image through one or more neural networks for things like face detection, pose detection, etc. The pre-processing consists of taking the original image and rotating it some amount several times and then building a composite image of all of the rotations, which is then passed to the neural network(s).

I’m new to the Nvidia/Jetson API, so I’ve started with the dusty-nv Hello-AI samples (posenet.py for example). The image captured from the input is in cudaImage format, and working with that is much faster than a standard OpenCV image on the Pi (obviously due to GPU acceleration), but I haven’t been able to find documentation on doing a lot of the operations I need on this format. For example, I have not found any python example that shows how to do a rotate operation on a cudaImage. To get around this I converted the cudaImage to an OpenCV image (by converting it to a numpy array and adjusting BRG->RGB), but when profiling my code it turns out that converting a cudaImage->OpenCV is fairly slow (20-30ms) and converting the OpenCV image back to cudaImage for processing by the posenet class is even slower (40-60ms). That conversion time ends up negating any performance benefits that the GPU acceleration of the Jetson provides.

Is there a better way to do this? Is there some document that explains how to manipulate cudaImage classes beyond just resizing and such? Should I not be using the Hello-AI classes and a guide to port my non-NVIDIA codebase?

Sorry for the likely dumb questions, but I’m completely new to the NVIDIA ecosystem and it’s a little overwhelming to just jump right in.

dusty_nv · January 4, 2023, 6:52pm

Hi @mjasner, cudaToNumpy() can be a persistent mapping so that you only need to call it once per cudaImage at the initialization of the program. It shares the memory with the numpy array as opposed to copying it. Any changes you make to it from numpy/OpenCV will be reflected in CUDA, ect.

Also, if you are on the master branch, I’ve implemented the numpy __array__ interface for cudaImage (in addition to the numba __cuda_array_interface__ and pycuda gpudata interface), so technically cudaToNumpy() shouldn’t be needed anymore, but YMMV. You can see a test of that in cuda-array-interface.py where it uses cudaImage directly with those libraries without intermediate conversions.

Also, since you are just doing rotations, you may be able to skip the RGB<->BGR conversions for OpenCV. If you can do the rotations in-place on a cudaImage that’s already been mapped to numpy, then you shouldn’t need to convert it back to CUDA with cudaFromNumpy() either.

I do have CUDA functions that can be used for matrix warps (and hence rotations), but they don’t have Python bindings for them. https://github.com/dusty-nv/jetson-utils/blob/8b6c5ca4b2b52d51f5415ce033ee52563ddf3f2c/cuda/cudaWarp.h

mjasner · January 4, 2023, 7:01pm

Thank you for the quick reply! I’m not sure I understand what you mean by persistent mapping that only needs to be called once. If I read a new frame every second (or faster, I’m just using 1FPS as a simple example) then wouldn’t I need to call it on every frame? For example if my program flow is something like the follow psuedocode:

Set up videoSource
while(runFlag is True)
Get next frame
process frame
do inference
handle results

Wouldn’t I need to call cudaToNumpy() every time I capture a new frame? How would the mapping to numpy remain from a previous frame and be able to affect the current frame?

Also, when I do the rotation I’m then compositing the rotation(s) onto a new, larger image, so that new, larger image would still need to be converted from numpy back to cuda since it didn’t previously exist as a cuda image to begin with, correct?

I guess life would be easier if I could figure out how to do a rotation of the original cuda image so I could cut out openCV entirely, because moving back and forth between the two is where i’m losing any gains from using CUDA at all. It sounds like your warp functions may be what I"m looking for. Perhaps I should look into migrating from python to C/C++. There really isn’t a reason I’m using one over the other, other than when I started the original version on other hardware the python samples were better documented. Are there examples of using your warp functions in C/C++ that I could look at?

Thanks again! The help is greatly appreciated

dusty_nv · January 5, 2023, 2:20pm

These frames captured from videoSource are in a ringbuffer, so the buffers get re-used every N frames, but yes that aside I’d assumed you had some intermediary image that was statically allocated. Also you could just try skipping the cudaToNumpy() all-together and make use of the __array__ interface that’s been implemented.

I would recommend statically allocating the larger output image so that you aren’t re-allocating it each frame. Try to re-allocate only when necessary (i.e. only when image sizes change)

Here is a snippet that warps an image by a 3x3 homography matrix: https://github.com/dusty-nv/jetson-inference/blob/552f7059b587d0477edd678c79e1566ab201297d/examples/experimental/featurenet/featurenet-images/featurenet-images.cpp#L231
You can see above in that code where I am testing it by manually constructing a 3x3 matrix composed of rotations, translations, ect.

An alternative would be to just write a Python binding for the cudaWarp() function that you need, similar as to is done here: https://github.com/dusty-nv/jetson-utils/blob/8b6c5ca4b2b52d51f5415ce033ee52563ddf3f2c/python/bindings/PyCUDA.cpp#L1273

mjasner · January 5, 2023, 3:48pm

Ah, so I could just pre-allocate an image for storing the input frame and a larger image for the composite frame, call cudaToNumpy on them once, and then they’re persistently mapped and I can just overwrite them without needing to call cudaToNumpy again? That’s useful. I’ll give that a try.

Thanks again!

mjasner · January 5, 2023, 4:47pm

Sorry, one last, possibly dumb question. If the return from videoSource is from a ring buffer then what I should be doing is fetching a frame and then copying that frame into my pre-allocated/pre-mapped buffer with cudaMemcpy. Is that correct?

dusty_nv · January 5, 2023, 5:24pm

Yes, if you can pre-allocate the images then you can do that.

You could do that (you may need to call cudaDeviceSynchronize() after the cudaMemcpy()) or you could just utilize the __array__ interface of the cudaImage and skip the cudaToNumpy() all together.

mjasner · January 5, 2023, 6:41pm

I’m not clear on how to use the __array__ interface. Regardless I’ve been able to write a demo of what you suggested, where I pre-allocate memory with cudaAllocMapped() and map it to numpy with cudaToNumpy(). Then I enter a loop where I read an image from gstCamera() and use cudaMemcpy() to copy the captured frame into the pre-allocated image. Then I can manipulate the mapped numpy array via OpenCV and I see the changes in the cudaImage when I display that with glDisplay. That makes sense.

What doesn’t make sense is going the other way. If I pre-allocate an output image the same way and map that to a numpy variable in the same way, when I then try and rotate the captured image (as above) and copy that into the output image’s mapped numpy array I don’t see any changes in the output image at all.

For reference, here is the python code I’m using:

# create display window
display = glDisplay()

# create camera device
camera = gstCamera(opt.width, opt.height, opt.camera)

# open the camera for streaming
camera.Open()

#Preallocate input image
img_b = jetson.utils.cudaAllocMapped(width=1280, height=720, format="rgb8")
cvImg=cudaToNumpy(img_b)

#Preallocate output image
comp=jetson.utils.cudaAllocMapped(width=2000, height=720, format="rgb8")
outImg=cudaToNumpy(comp)

# capture frames until user exits
while display.IsOpen():
        image, width, height = camera.Capture(format="rgb8")
        jetson.utils.cudaMemcpy(img_b, image)

        outImg=cv2.line(outImg, (5,5), (250,250), (255,0,255), 5)
        cudaDeviceSynchronize()

       display.RenderOnce(comp, outImg.shape[1], outImg.shape[0])
# close the camera
camera.Close()

This works fine. But if I then try and do something more complex, like taking the original input image, rotating it 90 degrees, and copying it to outImg (the numpy array mapped from the comp cudaImage) then I just get a black screen.

Additionally, I still have to upload/download the numpy array to the GPU in OpenCV frequently, which seems wasteful. Do you think I should just focus instead on learning to rotate the cudaImage directly instead of pushing it to OpenCV to do the rotation and then back? This is getting more convoluted then I think it needs to be, huh?

Thanks again

dusty_nv · January 5, 2023, 9:30pm

The __array__ interface means that you should be able to use a cudaImage just like you would a numpy array, because it implements that interface.

Yes, I would give it a try just making a Python binding for cudaWarpPerspective() or cudaWarpAffine(). Sorry I am catching up after the holidays, but I will add it to my TODO list.

mjasner · January 5, 2023, 9:33pm

No apologies necessary at all! You’ve been massively helpful! I’m going to spend some time reading up on the cudaWarpAffine() function since I’ve seen C/C++ examples of that in action and either try and make a python binding for it or just start porting my python code to C/C++, which is something I would ultimately like to do anyway.

Again, thanks very much for the help! It is greatly appreciated!

Honey_Patouceul · January 5, 2023, 10:23pm

Not sure for your case, but be you might be sure that cv::cuda::rotate would provide in place rotation with same resolution. If not, another GpuMat might be allocated. In this case, you may try to crop/rescale final rotation into original buffer.

system · January 25, 2023, 4:59am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Handing off cudaImage object to OpenCV CUDA function? (expects CV::MAT) Jetson Xavier NX opencv	12	3886	October 18, 2021
What is PyCapsule objects in Jetson-Inference Python scripts ? Jetson Nano	27	5470	October 14, 2021
videoSource.Capture()'s cudaImage to numpy Jetson Nano cuda , jetson-inference	9	3740	October 18, 2021
GStreamer: video rotation Jetson Xavier NX gstreamer , deepstream	10	2505	January 29, 2023
Jetson Nano - Limiting the results shown by the DetectNet example. Jetson Nano	8	2314	October 14, 2021
OpenCV application uneven frame times Jetson Xavier NX opencv , performance , opencl	14	2823	January 19, 2022
Design and architecture guidance Jetson Nano camera	45	2234	October 15, 2021
Using VideoSource frames with VPI in C++ Jetson Nano jetson-inference , vpi	11	1123	August 17, 2023
cudaToNumpy -> cv2.imshow not responding, no video output, no Error - csi camera Jetson Nano camera , opencv , cuda , jetson-inference	13	7814	October 15, 2021
Opencv cuda convolution extremly slower than bare cuda code convolution on Jetson Nano using unified memory Jetson Nano opencv	12	3717	October 18, 2021

Python: cudaImage <-> OpenCV conversions very slow

Related topics