Optimizing VPI Stereo Disparity on Jetson Orin 64GB - Seeking to Maximize Memory Usage

not_a_bot · June 11, 2025, 7:47pm

I am working on optimizing a stereo disparity pipeline using VPI on a Jetson Orin with 64GB of shared memory. My goal is to fully saturate the available memory to maximize throughput.

Current Approach:

I have wrapped the VPI stereo disparity example in a C++ function and am invoking it from Python using pybind11. To parallelize the workload, I am using Python’s multiprocessing library to spawn 12 processes, one for each of the Orin’s CPU cores.

Current Performance & Problem:

Each process consumes approximately 3GB of the shared memory to generate a single depth map. With 12 processes running in parallel, the total memory consumption is around 36GB.

This leaves about 28GB of shared memory unused. My objective is to leverage this remaining memory to further increase the number of concurrent stereo disparity estimations.

What I’ve Tried:

I attempted to use C++ threads (std::thread ) within my C++ wrapper to submit more tasks to the VPI pipeline. However, I found that the overhead of joining the threads after task submission led to a degradation in performance compared to the multiprocessing approach.

Questions:

Is there a more effective way to batch or parallelize the VPI stereo disparity pipeline to utilize the full 64GB of memory?
Are there any best practices or recommended design patterns for scaling VPI workloads to this extent on the Jetson Orin?

I would appreciate any insights or suggestions from the community on how to better approach this optimization problem.

AastaLLL · June 12, 2025, 5:35am

Hi,

You can find some tips below:

Run the algorithm in batches and measure its average running time within each batch. The number of calls in a batch varies with the approximate running time (faster algorithms, larger batch, max 100 calls). This is done to exclude the time spent performing the measurement itself from the algorithm runtime.

For example:
In our 05_benchmark sample, we submit bunches of VPI tasks and then do the synchronization together to increase performance.

// Record stream queue when we start processing
CHECK_STATUS(vpiEventRecord(evStart, stream));
  
// Get the average running time within this batch.
for (int i = 0; i < AVERAGING_COUNT; ++i)
{
    // Call the algorithm to be measured.
    CHECK_STATUS(vpiSubmitGaussianFilter(stream, backend, image, blurred, 5, 5, 1, 1, VPI_BORDER_ZERO));
}
  
// Record stream queue just after blurring
CHECK_STATUS(vpiEventRecord(evStop, stream));

Thanks.

system · July 16, 2025, 2:21am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unable to use multiprocessing or GNU Parallel with VPI Jetson AGX Orin python , vpi	6	847	September 18, 2023
[WARN ] 2023-12-22 02:19:36 VPI_ERROR_INVALID_OPERATION: PVA is not available and may be oversubscribed in the system: PvaError_DeviceUnavailable Jetson Orin NX cuda , python , vpi	10	1077	March 12, 2024
Stereo disparity block matching using Vision Accelerator DeepStream SDK opencv , cuda	12	1156	October 12, 2021
Disparity map with VPI in jetson Xavier AGX Jetson AGX Xavier vpi	4	1274	October 18, 2021
Various problems with VPI stereo disparity estimation algo Jetson Xavier NX cuda , nvbugs , vpi	24	2525	November 15, 2021
VPI and VisionWorks questions Jetson AGX Xavier	6	1681	October 18, 2021
Jetson shuts down when running VPI stereo disparity Jetson Xavier NX board-design , vpi	7	653	August 24, 2022
02-stereo_disparity on pva doesn't run with new jetpack 4.3 Jetson AGX Xavier	10	797	October 18, 2021
VPI Stereo Disparity Real World Results Jetson Nano nvbugs , vpi	7	1493	March 2, 2022
VPI efficiency issues Jetson AGX Xavier vpi	2	378	October 12, 2023

Optimizing VPI Stereo Disparity on Jetson Orin 64GB - Seeking to Maximize Memory Usage

Related topics