Very slow performance of blur using VPI

Hello,
We want to use VPI package for utilizing the PVA on our Jetson Xavier AGX for some algorithms that are currently using OpenCV.
We followed your blur tutorial for vpi and adjusted the syntax for supporting our use-case + our vpi version which is lower (0.4).
Unfortunately, the speed performance of the blur using VPI is extremely slow- between 15ms to 50ms or even 150ms sometimes. And for this timing we’ve used the same input image over and over again. We had much faster times when using OpenCV for blur on the same image size so there should be some issue with the way we use VPI here.
I attach here the code for the function that calls the blur of VPI:
lowPassVpi.cc (4.4 KB)
it’s very similar to the code in your tutorial.
Our input image is already a 1-channel image (a channel from LAB colorspace).
Note: the slow performance is happening also when we’re using other backends such as CUDA or CPU- but we prefer to use PVA as a backend when using VPI algorithms.

What’s the issue that causing the slow performance of the vpi-blur? How can this be solved?

GPU Type : Xavier
Nvidia Driver Version : Package:nvidia-jetpack, Version: 4.4.1-b50
VPI version : 0.4
CUDA Version : 10.2.89
CUDNN Version : 8.0.0
Operating System + Version : Ubuntu 18.04

Hi,

You can find the expected boxfilter performance result below:

https://docs.nvidia.com/vpi/algo_box_filter.html#algo_box_filter_perf

For an u8 1920x1080 image, 5x5 kernel should achieve <1ms (CPU or GPU) or ~1.3ms (PVA).
This is much much faster than the data you shared.

A possible reason is that VPI has it’s own device boosting script.
Could you apply the script and see if it helps?
https://docs.nvidia.com/vpi/algo_performance.html#maxout_clocks

Thanks.

Hi,
I ran the script you shared before running the code that uses VPI (the function I shared)- it helped reducing the time a bit, but it’s still not close to your benchmark:
PVA- 25-50ms
CPU/CUDA- 5-10ms
My image size is 224x1408 with 1 channel (“A” channel from “LAB” colorspace). The dimension I use is actually smaller than the size you used, pixelwise. I used 5x5 kernel as well for the blurring algorithm.
Is there anything else that needs to be adjusted for faster runtime of VPI?
Also, important note: We run VPI via a container on the Jetson - is there some additional optimization that’s needed for running in a container code that uses VPI on the Jetson?

Hi,

What kind of VPI image type do you use?
Do you map it from a host buffer?

Thanks.

Hi,
we are running the nvcr.io/nvidia/l4t-base:r32.4.3 image, voluming /opt/nvidia inside the container.

Hi,

How do you create VPIImage?
Have you used vpiImageCreateHostMemWrapper, vpiImageCreateCUDAMemWrapper or other wrappers?

Thanks.

Hi,
We use vpiImageCreateHostMemWrapper and vpiImageCreate , in the same way the original vpi blur tutorial did.
The exact call can be seen in lines 43 and 48 in the attached code I added in the post (see file lowPassVpi.cc above).

I attach here also the code of the VPI blur tutorial for version 0.4, since the link in my previous comment shows the syntax for VPI version 1.0 and we need to use version 0.4 since that’s what we have installed:
tutorial_vpi_blur_main.cpp (6.2 KB)

Hi,

It seems that your source doesn’t follow our bechmark guidance.
You can find the information in our document below:
https://docs.nvidia.com/vpi/algo_performance.html#benchmark

For example, you will need to add some warm-up time and measure the performance in average rather than just one time.
Thanks.

Sorry, but it is ridiculous.
We are using the blur function in a while loop thousands of times, so it is obviously not a warm-up issue.
Anyhow, we are a real-time company, we don’t mind the average time if a single blur can take more than 20 milliseconds!
Can you share us with your worst-case measures?
Thanks,
Livne.

Hi,
Did you follow VPI - Vision Programming Interface: Performance Benchmark :
"

  1. All payloads, inputs and output memory buffers are created beforehand.
  2. One second warm-up time running the algorithm in a loop.
  3. Run the algorithm in batches and measure its average running time within each batch. The number of calls in a batch varies with the approximate running time (faster algorithms, larger batch, max 100 calls). This is done to exclude the time spent performing the measurement itself from the algorithm runtime.
  4. Perform item 3 for at least 5s, making sure that we do it at least 10 times.
  5. From all average running times of each batch, we exclude the 5% lowest and highest values.
  6. From the result set, we take the median. This is the value used as final run time for the algorithm." , the following clocks section and powermode?

Thanks

Hi,
Like we said- we run the vpi algorithm in a loop on a set of images so it can count as a warmup- the timing we wrote are in the same order of magnitude (25-50ms for PVA) after a while.
The rest of the benchmark guidelines are indeed followed, as can be seen in the sample code we shared - inputs/outputs allocations beforehand etc. The median times are still ~25ms.
We also maximize the power parameters before running the code, by running the suggested script the guidelines share.
We still have a feeling this is more related to VPI being run via a container- we elaborated regarding the image we use in one of the above replies by Livne.

Hi,

Sorry for the late update.

Have you tested it outside of the container?
Could you test it to see if any difference in the performance?

By the way, we do expect the blurring task being submitted as loop.
But in the example shared in Apr 6., the sample only run the vpiSubmitBoxFilter once.

Thanks.

Hi
Our use case is to run it inside the container so it’s not relevant for us to try and run it outside the container. We would like to have the same time performance when running inside the container.

Regarding the example I shared- we don’t share our whole code of the system, I only shared the relevant part that utilizes VPI. The “outer code” that runs VPI is running in a loop infinitely.

From the example you shared it seems like you are allocating new VPI images in every LowPassVPI call. Using vpiImageCreateHostMemWrapper and vpiImageSetWrappedHostMem should be more efficient. Also don’t set flags to enable all backends if you only intend to use the PVA as it will likely cause some overhead for GPU etc.