Optimizing VPI Stereo Disparity on Jetson Orin 64GB - Seeking to Maximize Memory Usage

I am working on optimizing a stereo disparity pipeline using VPI on a Jetson Orin with 64GB of shared memory. My goal is to fully saturate the available memory to maximize throughput.

Current Approach:

I have wrapped the VPI stereo disparity example in a C++ function and am invoking it from Python using pybind11. To parallelize the workload, I am using Python’s multiprocessing library to spawn 12 processes, one for each of the Orin’s CPU cores.

Current Performance & Problem:

Each process consumes approximately 3GB of the shared memory to generate a single depth map. With 12 processes running in parallel, the total memory consumption is around 36GB.

This leaves about 28GB of shared memory unused. My objective is to leverage this remaining memory to further increase the number of concurrent stereo disparity estimations.

What I’ve Tried:

I attempted to use C++ threads (std::thread ) within my C++ wrapper to submit more tasks to the VPI pipeline. However, I found that the overhead of joining the threads after task submission led to a degradation in performance compared to the multiprocessing approach.

Questions:

  1. Is there a more effective way to batch or parallelize the VPI stereo disparity pipeline to utilize the full 64GB of memory?
  2. Are there any best practices or recommended design patterns for scaling VPI workloads to this extent on the Jetson Orin?

I would appreciate any insights or suggestions from the community on how to better approach this optimization problem.

Hi,

You can find some tips below:

Run the algorithm in batches and measure its average running time within each batch. The number of calls in a batch varies with the approximate running time (faster algorithms, larger batch, max 100 calls). This is done to exclude the time spent performing the measurement itself from the algorithm runtime.

For example:
In our 05_benchmark sample, we submit bunches of VPI tasks and then do the synchronization together to increase performance.

// Record stream queue when we start processing
CHECK_STATUS(vpiEventRecord(evStart, stream));
  
// Get the average running time within this batch.
for (int i = 0; i < AVERAGING_COUNT; ++i)
{
    // Call the algorithm to be measured.
    CHECK_STATUS(vpiSubmitGaussianFilter(stream, backend, image, blurred, 5, 5, 1, 1, VPI_BORDER_ZERO));
}
  
// Record stream queue just after blurring
CHECK_STATUS(vpiEventRecord(evStop, stream));

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.