How to share GPU buffer between VPI, and Jetson Inference Detect in a child thread?

Hardware: Xavier NX
Deepstream 6.1
jetpack 5.0.2-b231
tensorRT/nvinfer 8.4.1
Python 3.8

Hello, we have a main thread loop which is pulling in images, then with VPI - uploading them to GPU, performing an image warp, then locking to CPU and sending to our endpoint

(psuedo code)

while True:
    (get image)
    with vpi.Backend.CUDA:
        frame1 = vpi.asimage(timestamp_img)
        frame1 = frame1.perspwarp(hom)
    with frame1.rlock_cpu() as data:
           out_stream.write(data.copy())

This main thread creates a child thread which uses Jetson Inference object:
(psuedo code)

net = detectNet(
        "ssd-mobilenet-v2",
        threshold=0.1)
while True
(get image somehow)

  cuda_mem = jetson_utils.cudaFromNumpy(img)
  detections = net.Detect(cuda_mem)

which also runs in a loop performing detections, but too slow to put in main thread as we need 30fps

My question is how do I share the input images from the main thread with the child thread? The images are 2 * 4K so I am hoping for something performant, ideally I can share/copy the VPI GPU object like a pointer and pass it with a python queue. If I pass copied images in a queue its very slow, and if I use shared memory it seems hacky

Any help appreciated

The particulars of our application require this parallelism, as the output to the user must be 30fps and not limited by the inference - which will be lower FPS due to large images

Hi @liellplane, please refer to this other recent topic and my suggestion to use a cudaImage mapping as the output of VPI, so that the data is already in the cudaImage:

The detectNet is going to downsample your images to 300x300 for inference anyways, so I might recommend downsampling them to a lower resolution than 4K on the VPI side before you even copy them to the inference thread.

1 Like

thats a good point about the input size, I might have to process the images in segments though so probably a big payload regardless

Some good reading there - ok so this cudaImage can be passed from one thread to another in for example a python queue?

Yes, I have done basic Python multithreading with it. It does get more complicated to manage if/when you start doing CUDA processing from multiple threads/streams and the synchronization. The cudaImage’s are typically allocated as “mapped” memory so they can be accessed from the CPU too.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.