VPI timings

Hi,

I tried to measure the times for the harris corner detector and do not see the numbers to match what is listed on the performance table mentioned here. Any suggestions on what could be going on?
The following is my prints

Image dimensions:768x512
backendType:2
NVMEDIA_ARRAY: 53, Version 2.1
NVMEDIA_VPI : 156, Version 2.3

Harris time: 1477.57
185 keypoints found

I have also uploaded the modified main.cpp file for this sample project.
main.cpp (9.3 KB)

Another note here, I did run the clocks.sh script mentioned in the vpi documentation before performing the timing tests. And I used the kodim08.jpg image as input.

./vpi_sample_03_harris_corners cuda …/assets/kodim08.png

Hi,

Please maximize the device performance first.
VPI has its own boost script, which can be found here:
https://docs.nvidia.com/vpi/algo_performance.html#maxout_clocks

Thanks.

Hi @AastaLLL the results that I posted were after I ran the clocks.sh script that you mentioned.

Hi @AastaLLL, any update on this? I did ensure all the performances were maximized before taking the measurements that I posted on my first post.

Hi,

Could you test it with vpiImageCreate rather than vpiImageCreateHostMemWrapper first?
When wrapping the image from a CPU buffer, the memory access time depends on the buffer location.
The performance listed in our document is generated with a GPU buffer input.

Thanks.

Do you mean us GPUMat and then create a blank image using vpiImageCreate. Then use vpiImageCreateCudaMemWrapper?

I tried just creating a blank image and running without any wrapper. This means the algorithm is running on a blank image. The time readouts are as follows:

Image size is 768x512 (kodi08.png).

1275.87
1463.39
1015.26
1122.69
950.752
1008.1
1095.71

Then I used GpuMat and created image using vpiImageCreate. Then I used vpiImageCreateCudaMemWrapper. So now it is running on actual image and the times are as follows:

1414.05
1463.62
1137.22
1444.1
1644.93

Then i removed vpiImageCreate and just used vpiImageCreateCudaMemWrapper on GpuMat. The times are as follows:

2066.05
1423.94
1283.52
1222.66
1275.58
1559.78

Hi,

Is your time represented as us?
If yes, it seems similar to the benchmark result:

0.8433±0.0008 ms

It’s expected that there is some overhead when wrapping a buffer from existing data.
The benchmark result only takes the algorithm performance into account.
It doesn’t include the latency from different memory types.

Thanks.

Yes my time is in us. However my image size is 768x512 which is the kodi image in assets folder. According to benchmarking, the image size was 1080p, so there is still a difference between benchmarking and what I am seeing.

Hi,

Would you mind to generate a 1920x1080 image and give it a try?

int width = 1920, height = 1080;
VPIImageFormat imgFormat = VPI_IMAGE_FORMAT_U8;

// Create image with zero content
CHECK_STATUS(vpiImageCreate(width, height, imgFormat, 0, &image));

Thanks

Hello,

The following is the output when I created a 1920x1080 image as mentioned by you above.

NVMEDIA_ARRAY: 53, Version 2.1
NVMEDIA_VPI : 156, Version 2.3
Harris time: 1381.8
0 keypoints found

Note: This is average of 30 iterations.

Here is the output of jetson_clocks --show

SOC family:tegra194 Machine:Jetson-AGX
Online CPUs: 0-7
CPU Cluster Switching: Disabled
cpu0: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu1: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu2: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu3: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu4: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu5: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu6: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu7: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
GPU MinFreq=1377000000 MaxFreq=1377000000 CurrentFreq=1377000000
EMC MinFreq=204000000 MaxFreq=2133000000 CurrentFreq=2133000000 FreqOverride=1
Fan: speed=255
NV Power Mode: MAXN

Thanks.

Hi @AastaLLL, any updates on this?

Hi,

Sorry for the late update.
Confirmed that we can reproduce this issue on VPI v1.0 (JetPack4.5).
( 1.309437 ms for CUDA u8 format with size = 5 )

We are now checking this with our internal team.
Will share more information with you later.

Thanks.

Thank you @AastaLLL If you could let me know when you expect it to be resolved, that would be great. Appreciate it.

Hi,

Sorry that it takes some time to discuss this issue internally.
Please find attachment for the example to reproduce the benchmark score.

We are also working on updating the sample to our document.

main.cpp (6.7 KB) Makefile (4.8 KB) clock.sh (1.8 KB)

$ sudo ./clock.sh --max
$ make
$ ./vpi_sample_05_timing cuda

We can get ~ 0.850090 ms with a 1920 x 1080 input on the Xavier.

Thanks.