VIC performance

user19458 · January 17, 2024, 8:05am

Hi,
I’m using Jetson Orin 64GB.
VPI 2.0
So I’m trying to understand if using VPI for remapping will be efficient on VIC in our case but got a bit lost when reading the performance table in the Remap section.

In the current test I use stream for each camera, and run it one by one (2 cameras/images).
First camera output is 3848x1606 and the second camera output is 1920x1280.
Each image format type is BGR but I convert those to BGRA (due to VIC limitations) and then using the VPI stream convert the image to NV12_ER and run the Remap (will try to omit the conversions later on).

Questions:

According to the performance table of the Remap (Orin, nv12_er, using VIC, 1 stream) I should see ~3.8msec for the remap operation in case of 1920x1280, is that correct ?
Didn’t got the interval column in the table, what does it stands for?
I’ve read in VPIs performance section that you have measured the performance after using clock.sh to max out the VIC clock, Should I use it as well ? What is the price for running the script constantly on startup?
Can I assume that the performance shown in the table for the VIC will be proportional to the pixel count of different image size ?

Thanks in advance,
Barak

AastaLLL · January 18, 2024, 2:34am

Hi,

1. Based on the VPI 2.0 document here:
1920x1080 remap with VIC is around 3.691~3.71ms.

2. Interval indicates the “control point interval (density)”.

3. Yes, the clock setting can boost VIC performance.
It’s a valid setting so no harm to the device.

4. Suppose yes.

Thanks.

user19458 · January 18, 2024, 9:24am

Thank you for your response.
I turned to use CUDA backend to achieve the performance of the table and also failed.
For RGB Image of 1920x1280 I got 2.78 msec which is far from the values in the table.
I’ve also ran clocks.sh --max according to the profiling description.
Any idea for the reason of the difference ?

Thanks,
Barak

AastaLLL · January 24, 2024, 7:04am

Hi,

Please check our document below:
https://docs.nvidia.com/vpi/2.0/algo_performance.html#benchmark

The value is calculated with batches and has some warm-up.
Do you profile it with a similar approach?

An example can be found in /opt/nvidia/vpi2/samples/05-benchmark/.

Thanks.

user19458 · January 24, 2024, 7:47am

I’m running 100 iterations, each iteration wraps cv::Mat with VPIImage object using vpiImageSetWrappedOpenCVMat. VPI_EXCLUSIVE_STREAM_ACCESS flag doesn’t seems to work in this case.
I’ve ran it several times.
Best duration I’ve achieved was 2.25msec for 1920x1280 RGB image.

AastaLLL · January 25, 2024, 6:32am

Hi,

Please benchmark the algorithm directly.
Wrapper from OpenCV induces some extra IO latency.

For example:

CHECK_STATUS(vpiEventRecord(evStart, stream));

for (int i = 0; i < AVERAGING_COUNT; ++i)
{
    CHECK_STATUS(vpiSubmitGaussianFilter(stream, backend, image, blurred, 5, 5, 1, 1, 
    VPI_BORDER_ZERO));
}

CHECK_STATUS(vpiEventRecord(evStop, stream));

Thanks.

system · February 14, 2024, 7:00am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.