VPI very slow compared to OpenCV CPU

erich.voko · October 4, 2021, 9:48am

Hi,
I use the new VPI 1.1 with python 3.6 on a B01 nano with JetPack 32.6.1.

My problem is, that all VPI is much slower compared to OpenCV on the CPU (preinstalled version).

At the moment I use the rescale alog 3x in my program, grayscale image 4k downscale.
If I use OpenCV on CPU I got 0.7 - 2.1ms duration time, with VPI CUDA 16 - 23ms and VPI CPU 8 - 14ms.
So the current OpenCV CPU version is much faster then VPI, even VPI CPU is faster then VPI CUDA.
I have an overall framerate for my system and this corresponds to the speed differences of the used algos.

Here part of my code:

    start_tick_vpi = time.perf_counter()
    if constants.USE_VPI:
        tmp_vpi = vpi.asimage(image_corrected).rescale((new_size[1], new_size[0]), backend=vpi.Backend.CUDA, interp=vpi.Interp.NEAREST)
        img_tmp = tmp_vpi.cpu()
    else:
        img_tmp = cv2.resize(image_corrected, (new_size[1], new_size[0]), interpolation=cv2.INTER_NEAREST)
    utility.print_timing(start_tick_vpi, "VPI rescale", console_print=True)

The vpi.clear_cache() I do in my main program one time each frame, no difference if I do it after each vpi use.
I tried your clocks.sh and jetson_clocks, no difference in speed (I use external 6A psu with full speed mode).

If I use VPI I can see a drop of CPU usage but the VPI commands are way to slow.

BR Erich

AastaLLL · October 5, 2021, 2:23am

Hi,

In the VPI part, the timing includes buffer wrapping and scaling.
But there is only scaling function measured in the OpenCV case (since the input is already cvmat).

Would you mind rewriting the function like below:

if constants.USE_VPI:
    tmp_vpi = vpi.asimage(image_corrected)

    start_tick_vpi = time.perf_counter()
    tmp_vpi = tmp_vpi.rescale((new_size[1], new_size[0]), backend=vpi.Backend.CUDA, 
interp=vpi.Interp.NEAREST)
    utility.print_timing(start_tick_vpi, "VPI rescale", console_print=True)

    img_tmp = tmp_vpi.cpu()
else:
    start_tick_vpi = time.perf_counter()
    img_tmp = cv2.resize(image_corrected, (new_size[1], new_size[0]), interpolation=cv2.INTER_NEAREST)
    utility.print_timing(start_tick_vpi, "VPI rescale", console_print=True)

This will compare the scaling performance fairly.
It’s expected that copy buffer from cpu->gpu and move it back induces some overhead.
But usually you can get some performance gain via GPU acceleration.

The trade-off depends on the computing complexity applied to the buffer.
You can find some information in the below document:

Thanks.

erich.voko · October 5, 2021, 11:23am

Hi @AastaLLL

I tried this and the pure calculation time is much faster, thanks for the hint.
With CUDA now 1-5ms range.
But why is the type converting so slow? This is something you have to do always, especially when you mix a lot of image calculations between numpy, OpenCV and VPI.
Is there a trick to do this faster?

BR Erich

AastaLLL · October 7, 2021, 5:51am

Hi,

If you choose the GPU backend, the transfer will apply CPU to GPU memory copy.
And this is bounded by the hardware bandwidth limitation.

There are lots of VPI Image wrappers. You can check them for details:
https://docs.nvidia.com/vpi/group__VPI__CUDAInterop.html

vpiImageCreateXXXXWrapper

Thanks.

erich.voko · October 7, 2021, 9:19am

Sorry, I’m no hardcore programmer, this does not help me.
What is this information for and how to use it in Pyhton to speedup things?

BR Erich

AastaLLL · October 19, 2021, 6:11am

Hi,

Please test the following example:

test.py (582 Bytes)

$ sudo nvpmodel -m 0
$ sudo jetson_clocks
$ python3 test.py

VPI tasks ~7 ms which is acceptable since it also transfers data to GPU.

Thanks.

erich.voko · October 27, 2021, 9:44am

Hi, tried it but you have forgotten the conversion back with .cpu in your code, otherwise I cannot use it in OpenCv and/or numpy for further things.

OpenCV takes 0.001 s
NVMEDIA_ARRAY:   53,  Version 2.1
NVMEDIA_VPI :  172,  Version 2.4
VPI takes 0.009 s

So it is factor 10 slower then OpenCV on CPU for rescaling here.
Not really useable in an bigger system with a lot of image processing with OpenCV and numpy with this few possible operations of VPI. :-(
Need to rethink my complete code and program structure to maybe have ab advantage of VPI, as it is today with mixed operations on a lot of positions in the code I cannot change some operations to VPI to speedup.

BR Erich

system · November 10, 2021, 9:45am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
VPI 1.2 very slow to download image from GPU - any tips? Jetson Nano vpi	7	801	April 17, 2023
Very slow performance of blur using VPI Jetson AGX Xavier jetson-inference , vpi	15	1337	October 18, 2021
Image processing speed issue with CUDA CUDA Programming and Performance	2	152	June 14, 2024
VPI pipeline with CSI camera and output render on screen Jetson Xavier NX vpi	6	2439	December 15, 2021
Performance about VPI ConvertImageFormat Jetson AGX Orin vpi	4	95	July 18, 2024
VPI 0.3.7 vpiImageWrapHostMem slow compared to algo Jetson AGX Xavier nvbugs , graphics	7	695	July 30, 2020
Calling vpi for computation is very slow Jetson AGX Xavier vpi	5	396	November 27, 2023
OpenCV application uneven frame times Jetson Xavier NX opencv , performance , opencl	14	2749	January 19, 2022
CUDA code too slow Jetson Nano cuda	6	1761	July 26, 2022
Too slow OPENCV with CUDA compiled, why? Jetson Nano opencv	5	4788	October 18, 2021

VPI very slow compared to OpenCV CPU

Related topics