Getting image bits to GPU for Inference (DetectNet)

I have a system in place where a python process gets started connects to the camera and starts pulling frames for other python processes to digest. The main process takes the output and serves it via MJPG to a client. I was primarily using this with opencv and use a ramdisk for speed to transfer the bits between processes. With opencv saving directly from the camera to a numpy array and back again works super fast.

Now the problem, I just started messing with DetectNet and the jetson.inference packages. This thing wants the bits in the GPU RAM and I can’t find a quick process to get them in the right form… I hacked it together by reading in the numpy array from RAM saving it to a PNG in RAM and then reading the PNG with jetson.utils.loadImageRGBA. For whatever reason saving to PNG is realllllly slow. This is a 1280x720 image taking like 2 seconds. The PNG process is using PIL.

Is there a quicker way to get a numpy array (or similar) into GPU to be processed by the nets?

Hi, there is a cudaFromNumpy() function in jetson.utils module, and a sample for it found in jetson-inference/utils/python/examples

Its not super fast but it is a lot faster than saving to disk.

Thanks dusty, I looked again after I posted and found those in the README that I must have missed the last time I looked!

However, I attempted to use it and kept getting an error when passing the cuda capsule to the detect function. I tried hardcoding the width and height and switching it and everything and can’t seem to get it. The error seems generic so not sure what is happening:

detectNet – maximum bounding boxes: 100
detectNet – loaded 91 class info entries
detectNet – number of object classes: 91
jetson.utils – cudaFromNumpy() ndarray dim 0 = 720
jetson.utils – cudaFromNumpy() ndarray dim 1 = 1280
jetson.utils – cudaFromNumpy() ndarray dim 2 = 3
Width 1280 Height 720
[TRT] engine.cpp (555) - Cuda Error in execute: 4
[TRT] engine.cpp (555) - Cuda Error in execute: 4
[TRT] detectNet::Detect() – failed to execute TensorRT context
Traceback (most recent call last):
File “/var/www/html/robot1/opencv_scripts/”, line 107, in
File “/var/www/html/robot1/opencv_scripts/”, line 87, in main
File “/var/www/html/robot1/opencv_scripts/”, line 47, in procFrame
detections = net.Detect(cuda_mem, 1280, 720)
Exception: jetson.inference – detectNet.Detect() encountered an error classifying the image
jetson.utils – freeing CUDA mapped memory
[cuda] cudaFreeHost(ptr)
[cuda] unspecified launch failure (error 4) (hex 0x04)
[cuda] /home/robot1/jetson-inference/utils/python/bindings/PyCUDA.cpp:121
jetson.utils – failed to free CUDA mapped memory with cudaFreeHost()
[cuda] cudaFreeHost(mDetectionSets[0])
[cuda] unspecified launch failure (error 4) (hex 0x04)
[cuda] /home/robot1/jetson-inference/c/detectNet.cpp:66
[cuda] cudaFreeHost(mClassColors[0])
[cuda] unspecified launch failure (error 4) (hex 0x04)
[cuda] /home/robot1/jetson-inference/c/detectNet.cpp:74
[TRT] runtime.cpp (30) - Cuda Error in free: 4
terminate called after throwing an instance of ‘nvinfer1::CudaError’
what(): std::exception

Thanks for you work on this library btw!

Also, added a print(cuda_mem) and it returned ok but still failed when trying to send it to detect:

<capsule object “jetson.utils.cudaAllocMapped” at 0x7f61fabab0>

Ah, it says dim 2 = 3, so it is a 3-channel image. However DetectNet expects a 4-channel RGBA image. Try passing a 4-channel image from OpenCV to cudaFromNumpy() and see if that helps.

Is there a quick way to convert to RGBA from within inference? I grab the frame from gst and it is a pi2 camera which seems to only grab the frame in RGB. I noticed in the code there is a gstCamera::ConvertRGBA which does a cudaRGB8ToRGBA32… is that something I can access from within Python? I am still getting used to the ins and outs of the python bindings as I normally work in C. I could probably just resize the numpy array one more and set all the A channel to 255 or leave it 0?


There aren’t Python bindings implemented for the CUDA colorspace conversion utilities, although you could write one. Could you use gstCamera (which has Python bindings) as it works with Raspberry Pi Camera Module v2 (and USB V4L2 webcams as well)? See for an example of that.

this works well for me, however, not as fast as I’d like it:

tsr_imga = cv2.cvtColor (tsr_img, cv2.COLOR_BGR2RGBA)
            cuda_mem = jetson.utils.cudaFromNumpy (tsr_imga)
            #print (cuda_mem)
            class_idx, confidence = net.Classify (cuda_mem, tsr_img.shape[0], tsr_img.shape[1])

My implementation of python TensorRT demos are all using numpy arrays for images. The demos include pure image classification (GoogLeNet), face detection (MTCNN) and SSD object detection. They all run pretty fast on Jetson Nano. (All FPS numbers take into account of image preprocessing, postprocessing and rendering.)

GoogLeNet: 60 FPS
SSD (MobileNet_V1_COCO): ~26 FPS

Refer to my GitHub repository and blog posts for details.

jkjung13, thanks for sharing.
I’m having issues installing tensorflow-gpu but once I get past that, I’ll definitely try your approach

I put all necessary steps to build/install tensorflow-1.12.2 (with gpu support) on Jetson Nano into 3 simple shell scripts. You could refer to my blog post for how to run the scripts:

I was following the notes from here for tensorflow-gpu and you are doing a lot more.
it also seems I’m using an older version of TRT:
[TRT] TensorRT version 5.0.6
so I’ll need to wait a bit before moving away from the jetson-inference context.