How to share GPU buffer between VPI, and Jetson Inference Detect in a child thread?

liellplane · April 20, 2023, 9:02am

Hardware: Xavier NX
Deepstream 6.1
jetpack 5.0.2-b231
tensorRT/nvinfer 8.4.1
Python 3.8

Hello, we have a main thread loop which is pulling in images, then with VPI - uploading them to GPU, performing an image warp, then locking to CPU and sending to our endpoint

(psuedo code)

while True:
    (get image)
    with vpi.Backend.CUDA:
        frame1 = vpi.asimage(timestamp_img)
        frame1 = frame1.perspwarp(hom)
    with frame1.rlock_cpu() as data:
           out_stream.write(data.copy())

This main thread creates a child thread which uses Jetson Inference object:
(psuedo code)

net = detectNet(
        "ssd-mobilenet-v2",
        threshold=0.1)
while True
(get image somehow)

  cuda_mem = jetson_utils.cudaFromNumpy(img)
  detections = net.Detect(cuda_mem)

which also runs in a loop performing detections, but too slow to put in main thread as we need 30fps

My question is how do I share the input images from the main thread with the child thread? The images are 2 * 4K so I am hoping for something performant, ideally I can share/copy the VPI GPU object like a pointer and pass it with a python queue. If I pass copied images in a queue its very slow, and if I use shared memory it seems hacky

Any help appreciated

The particulars of our application require this parallelism, as the output to the user must be 30fps and not limited by the inference - which will be lower FPS due to large images

dusty_nv · April 20, 2023, 4:22pm

Hi @liellplane, please refer to this other recent topic and my suggestion to use a cudaImage mapping as the output of VPI, so that the data is already in the cudaImage:

The detectNet is going to downsample your images to 300x300 for inference anyways, so I might recommend downsampling them to a lower resolution than 4K on the VPI side before you even copy them to the inference thread.

liellplane · April 20, 2023, 5:10pm

thats a good point about the input size, I might have to process the images in segments though so probably a big payload regardless

Some good reading there - ok so this cudaImage can be passed from one thread to another in for example a python queue?

dusty_nv · April 21, 2023, 2:15pm

Yes, I have done basic Python multithreading with it. It does get more complicated to manage if/when you start doing CUDA processing from multiple threads/streams and the synchronization. The cudaImage’s are typically allocated as “mapped” memory so they can be accessed from the CPU too.

system · May 17, 2023, 12:35am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shared memory parallel processing for jetson inference Jetson Nano jetson-inference	4	1394	August 17, 2022
Efficient VPI and NPP interop Jetson Nano vpi	4	1414	October 15, 2021
How to achieve best performances in feeding data to VisionWorks vx_image? Jetson TX1	2	680	October 18, 2021
Copy camera capture from nvbuffer to CPU DRAM is very slow using argus library Jetson TX2 camera	19	2225	February 23, 2022
VPI in a GStreamer pipeline Jetson Nano gstreamer , vpi	8	2935	October 3, 2021
Approaches to run TensorRT inference and VPI image functions simultaneously in GPU? Jetson Xavier NX vpi	4	705	April 26, 2023
multithread for CPU/GPU parrallel Jetson AGX Xavier	3	588	October 18, 2021
jetson-inference python multiprocessing Jetson Nano	11	3571	October 18, 2021
Multi-threaded code on Tx2 Jetson TX2	6	2525	October 18, 2021
Optimal way to copy data between a VPI pipeline and CUDA Kernel in PyCUDA Jetson Xavier NX kernel , python , pycuda , vpi	7	1200	April 24, 2023

How to share GPU buffer between VPI, and Jetson Inference Detect in a child thread?

Related topics