VPI very slow compared to OpenCV CPU

Hi,
I use the new VPI 1.1 with python 3.6 on a B01 nano with JetPack 32.6.1.

My problem is, that all VPI is much slower compared to OpenCV on the CPU (preinstalled version).

At the moment I use the rescale alog 3x in my program, grayscale image 4k downscale.
If I use OpenCV on CPU I got 0.7 - 2.1ms duration time, with VPI CUDA 16 - 23ms and VPI CPU 8 - 14ms.
So the current OpenCV CPU version is much faster then VPI, even VPI CPU is faster then VPI CUDA.
I have an overall framerate for my system and this corresponds to the speed differences of the used algos.

Here part of my code:

    start_tick_vpi = time.perf_counter()
    if constants.USE_VPI:
        tmp_vpi = vpi.asimage(image_corrected).rescale((new_size[1], new_size[0]), backend=vpi.Backend.CUDA, interp=vpi.Interp.NEAREST)
        img_tmp = tmp_vpi.cpu()
    else:
        img_tmp = cv2.resize(image_corrected, (new_size[1], new_size[0]), interpolation=cv2.INTER_NEAREST)
    utility.print_timing(start_tick_vpi, "VPI rescale", console_print=True)

The vpi.clear_cache() I do in my main program one time each frame, no difference if I do it after each vpi use.
I tried your clocks.sh and jetson_clocks, no difference in speed (I use external 6A psu with full speed mode).

If I use VPI I can see a drop of CPU usage but the VPI commands are way to slow.

BR Erich

Hi,

In the VPI part, the timing includes buffer wrapping and scaling.
But there is only scaling function measured in the OpenCV case (since the input is already cvmat).

Would you mind rewriting the function like below:

if constants.USE_VPI:
    tmp_vpi = vpi.asimage(image_corrected)

    start_tick_vpi = time.perf_counter()
    tmp_vpi = tmp_vpi.rescale((new_size[1], new_size[0]), backend=vpi.Backend.CUDA, 
interp=vpi.Interp.NEAREST)
    utility.print_timing(start_tick_vpi, "VPI rescale", console_print=True)

    img_tmp = tmp_vpi.cpu()
else:
    start_tick_vpi = time.perf_counter()
    img_tmp = cv2.resize(image_corrected, (new_size[1], new_size[0]), interpolation=cv2.INTER_NEAREST)
    utility.print_timing(start_tick_vpi, "VPI rescale", console_print=True)

This will compare the scaling performance fairly.
It’s expected that copy buffer from cpu->gpu and move it back induces some overhead.
But usually you can get some performance gain via GPU acceleration.

The trade-off depends on the computing complexity applied to the buffer.
You can find some information in the below document:
https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#porting-considerations

Thanks.

Hi @AastaLLL

I tried this and the pure calculation time is much faster, thanks for the hint.
With CUDA now 1-5ms range.
But why is the type converting so slow? This is something you have to do always, especially when you mix a lot of image calculations between numpy, OpenCV and VPI.
Is there a trick to do this faster?

BR Erich

Hi,

If you choose the GPU backend, the transfer will apply CPU to GPU memory copy.
And this is bounded by the hardware bandwidth limitation.

There are lots of VPI Image wrappers. You can check them for details:
https://docs.nvidia.com/vpi/group__VPI__CUDAInterop.html

vpiImageCreateXXXXWrapper

Thanks.

Sorry, I’m no hardcore programmer, this does not help me.
What is this information for and how to use it in Pyhton to speedup things?

BR Erich

Hi,

Please test the following example:

test.py (582 Bytes)

$ sudo nvpmodel -m 0
$ sudo jetson_clocks
$ python3 test.py

VPI tasks ~7 ms which is acceptable since it also transfers data to GPU.

Thanks.