I am working on optimizing a stereo disparity pipeline using VPI on a Jetson Orin with 64GB of shared memory. My goal is to fully saturate the available memory to maximize throughput.
Current Approach:
I have wrapped the VPI stereo disparity example in a C++ function and am invoking it from Python using pybind11. To parallelize the workload, I am using Python’s multiprocessing library to spawn 12 processes, one for each of the Orin’s CPU cores.
Current Performance & Problem:
Each process consumes approximately 3GB of the shared memory to generate a single depth map. With 12 processes running in parallel, the total memory consumption is around 36GB.
This leaves about 28GB of shared memory unused. My objective is to leverage this remaining memory to further increase the number of concurrent stereo disparity estimations.
What I’ve Tried:
I attempted to use C++ threads (std::thread ) within my C++ wrapper to submit more tasks to the VPI pipeline. However, I found that the overhead of joining the threads after task submission led to a degradation in performance compared to the multiprocessing approach.
Questions:
- Is there a more effective way to batch or parallelize the VPI stereo disparity pipeline to utilize the full 64GB of memory?
- Are there any best practices or recommended design patterns for scaling VPI workloads to this extent on the Jetson Orin?
I would appreciate any insights or suggestions from the community on how to better approach this optimization problem.