Python: cudaImage <-> OpenCV conversions very slow

I’m working on porting a project from a different platform (Raspberry Pi + Intel OpenVINO) to a Jetson Nano platform. The gist of the project is I need to read an image from a camera or video, do some pre-processing on it, and then run the image through one or more neural networks for things like face detection, pose detection, etc. The pre-processing consists of taking the original image and rotating it some amount several times and then building a composite image of all of the rotations, which is then passed to the neural network(s).

I’m new to the Nvidia/Jetson API, so I’ve started with the dusty-nv Hello-AI samples ( for example). The image captured from the input is in cudaImage format, and working with that is much faster than a standard OpenCV image on the Pi (obviously due to GPU acceleration), but I haven’t been able to find documentation on doing a lot of the operations I need on this format. For example, I have not found any python example that shows how to do a rotate operation on a cudaImage. To get around this I converted the cudaImage to an OpenCV image (by converting it to a numpy array and adjusting BRG->RGB), but when profiling my code it turns out that converting a cudaImage->OpenCV is fairly slow (20-30ms) and converting the OpenCV image back to cudaImage for processing by the posenet class is even slower (40-60ms). That conversion time ends up negating any performance benefits that the GPU acceleration of the Jetson provides.

Is there a better way to do this? Is there some document that explains how to manipulate cudaImage classes beyond just resizing and such? Should I not be using the Hello-AI classes and a guide to port my non-NVIDIA codebase?

Sorry for the likely dumb questions, but I’m completely new to the NVIDIA ecosystem and it’s a little overwhelming to just jump right in.

Hi @mjasner, cudaToNumpy() can be a persistent mapping so that you only need to call it once per cudaImage at the initialization of the program. It shares the memory with the numpy array as opposed to copying it. Any changes you make to it from numpy/OpenCV will be reflected in CUDA, ect.

Also, if you are on the master branch, I’ve implemented the numpy __array__ interface for cudaImage (in addition to the numba __cuda_array_interface__ and pycuda gpudata interface), so technically cudaToNumpy() shouldn’t be needed anymore, but YMMV. You can see a test of that in where it uses cudaImage directly with those libraries without intermediate conversions.

Also, since you are just doing rotations, you may be able to skip the RGB<->BGR conversions for OpenCV. If you can do the rotations in-place on a cudaImage that’s already been mapped to numpy, then you shouldn’t need to convert it back to CUDA with cudaFromNumpy() either.

I do have CUDA functions that can be used for matrix warps (and hence rotations), but they don’t have Python bindings for them.

Thank you for the quick reply! I’m not sure I understand what you mean by persistent mapping that only needs to be called once. If I read a new frame every second (or faster, I’m just using 1FPS as a simple example) then wouldn’t I need to call it on every frame? For example if my program flow is something like the follow psuedocode:

Set up videoSource
while(runFlag is True)
Get next frame
process frame
do inference
handle results

Wouldn’t I need to call cudaToNumpy() every time I capture a new frame? How would the mapping to numpy remain from a previous frame and be able to affect the current frame?

Also, when I do the rotation I’m then compositing the rotation(s) onto a new, larger image, so that new, larger image would still need to be converted from numpy back to cuda since it didn’t previously exist as a cuda image to begin with, correct?

I guess life would be easier if I could figure out how to do a rotation of the original cuda image so I could cut out openCV entirely, because moving back and forth between the two is where i’m losing any gains from using CUDA at all. It sounds like your warp functions may be what I"m looking for. Perhaps I should look into migrating from python to C/C++. There really isn’t a reason I’m using one over the other, other than when I started the original version on other hardware the python samples were better documented. Are there examples of using your warp functions in C/C++ that I could look at?

Thanks again! The help is greatly appreciated

These frames captured from videoSource are in a ringbuffer, so the buffers get re-used every N frames, but yes that aside I’d assumed you had some intermediary image that was statically allocated. Also you could just try skipping the cudaToNumpy() all-together and make use of the __array__ interface that’s been implemented.

I would recommend statically allocating the larger output image so that you aren’t re-allocating it each frame. Try to re-allocate only when necessary (i.e. only when image sizes change)

Here is a snippet that warps an image by a 3x3 homography matrix:
You can see above in that code where I am testing it by manually constructing a 3x3 matrix composed of rotations, translations, ect.

An alternative would be to just write a Python binding for the cudaWarp() function that you need, similar as to is done here:

Ah, so I could just pre-allocate an image for storing the input frame and a larger image for the composite frame, call cudaToNumpy on them once, and then they’re persistently mapped and I can just overwrite them without needing to call cudaToNumpy again? That’s useful. I’ll give that a try.

Thanks again!

Sorry, one last, possibly dumb question. If the return from videoSource is from a ring buffer then what I should be doing is fetching a frame and then copying that frame into my pre-allocated/pre-mapped buffer with cudaMemcpy. Is that correct?

Yes, if you can pre-allocate the images then you can do that.

You could do that (you may need to call cudaDeviceSynchronize() after the cudaMemcpy()) or you could just utilize the __array__ interface of the cudaImage and skip the cudaToNumpy() all together.

I’m not clear on how to use the __array__ interface. Regardless I’ve been able to write a demo of what you suggested, where I pre-allocate memory with cudaAllocMapped() and map it to numpy with cudaToNumpy(). Then I enter a loop where I read an image from gstCamera() and use cudaMemcpy() to copy the captured frame into the pre-allocated image. Then I can manipulate the mapped numpy array via OpenCV and I see the changes in the cudaImage when I display that with glDisplay. That makes sense.

What doesn’t make sense is going the other way. If I pre-allocate an output image the same way and map that to a numpy variable in the same way, when I then try and rotate the captured image (as above) and copy that into the output image’s mapped numpy array I don’t see any changes in the output image at all.

For reference, here is the python code I’m using:

# create display window
display = glDisplay()

# create camera device
camera = gstCamera(opt.width, opt.height,

# open the camera for streaming

#Preallocate input image
img_b = jetson.utils.cudaAllocMapped(width=1280, height=720, format="rgb8")

#Preallocate output image
comp=jetson.utils.cudaAllocMapped(width=2000, height=720, format="rgb8")

# capture frames until user exits
while display.IsOpen():
        image, width, height = camera.Capture(format="rgb8")
        jetson.utils.cudaMemcpy(img_b, image)

        outImg=cv2.line(outImg, (5,5), (250,250), (255,0,255), 5)

       display.RenderOnce(comp, outImg.shape[1], outImg.shape[0])
# close the camera

This works fine. But if I then try and do something more complex, like taking the original input image, rotating it 90 degrees, and copying it to outImg (the numpy array mapped from the comp cudaImage) then I just get a black screen.

Additionally, I still have to upload/download the numpy array to the GPU in OpenCV frequently, which seems wasteful. Do you think I should just focus instead on learning to rotate the cudaImage directly instead of pushing it to OpenCV to do the rotation and then back? This is getting more convoluted then I think it needs to be, huh?

Thanks again

The __array__ interface means that you should be able to use a cudaImage just like you would a numpy array, because it implements that interface.

Yes, I would give it a try just making a Python binding for cudaWarpPerspective() or cudaWarpAffine(). Sorry I am catching up after the holidays, but I will add it to my TODO list.

No apologies necessary at all! You’ve been massively helpful! I’m going to spend some time reading up on the cudaWarpAffine() function since I’ve seen C/C++ examples of that in action and either try and make a python binding for it or just start porting my python code to C/C++, which is something I would ultimately like to do anyway.

Again, thanks very much for the help! It is greatly appreciated!

Not sure for your case, but be you might be sure that cv::cuda::rotate would provide in place rotation with same resolution. If not, another GpuMat might be allocated. In this case, you may try to crop/rescale final rotation into original buffer.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.