Shared memory parallel processing for jetson inference


I am working on a Jetson Nano in Jetpack 4.5.1

I am working in python 3.6 and attempting to create a producer/consumer system to efficiently perform inference and other operations in images. I am hoping to use a producer process and consumer processes that use shared memory to make use of multiple cores to encode jpeg binary, perform object detection and more in parallel.

Basically the goal is a camera at 30fps and multiple processes pulling from that camera simultaneously. I’m not worried about each process fetching every frame but rather being able to have different processes pull the most recent frame whenever they are ready.

EX: Camera pulling frames at 30fps
→ Object detection at 20fps
→ filtering the image and encoding a jpeg binary to send elsewhere at 28fps
→ saving images locally at 5fps

I am using numpy, and the python multiprocessing model but am running into very strange issues that I suspect are from messing with cuda zero copy memory. Could anyone point me to better tools to distribute a video source to multiple processes for performance.


Hi @daniel181, CUDA memory isn’t shared across processes, so the threads would need to be intra-process.

The most efficient model would ideally to perform your image processing operations on the GPU and just queue them in a pipeline with the inferencing. There’s typically not a need for CPU multithreading with that.

Sounds good. Do you mind pointing me in the direction of where to learn how to queue a GPU pipeline?

CUDA kernels are inherently executed asynchronously. It isn’t until you perform a synchronization on a CUDA stream or with cudaDeviceSynchronize() that synchronization occurs.

If you are using numpy for your operations today, you may want to look into cupy, which is like the CUDA version of numpy.