VPI 0.3.7 vpiImageWrapHostMem slow compared to algo

Hello everyone,

I’m currently developing a part of an application, that scales an RGB 1920X1080 down to 608x608 using VPI 0.3.7 with CUDA backend.
The app is running inside a Docker container on a Jetson Xavier 32GB with Max-N settings. For this, I modified the resampling sample to work with video streams (opencv Mat). vpiImageCreate and the memset for the image data are called only once in an init function.
As VPI resample can only work with single-channel data, i split the input frame using cv::split and proccess the channels in different vpiImages. I added VPI events to measure the execution time.

“Uploading” the three channels needs 15-20ms using

        std::vector<cv::Mat> channels(3);
        cv::split(frame, channels);

        CHECK_STATUS(vpiEventRecord(evStart, stream_b));
        imgData_b.type = VPI_IMAGE_TYPE_U8;
        imgData_b.numPlanes = 1;
        imgData_b.planes[0].width = channels[0].cols;
        imgData_b.planes[0].height = channels[0].rows;
        imgData_b.planes[0].rowStride = channels[0].step[0];
        imgData_b.planes[0].data = channels[0].data;

        CHECK_STATUS(vpiImageWrapHostMem(& imgData_b, 0, &image_b));

resampling ~1.5ms

        CHECK_STATUS(vpiSubmitImageResampler(stream_b, image_g, scaled_g, VPI_INTERP_NEAREST, VPI_BOUNDARY_COND_ZERO));
        CHECK_STATUS(vpiSubmitImageResampler(stream_b, image_b, scaled_b, VPI_INTERP_NEAREST, VPI_BOUNDARY_COND_ZERO));
        CHECK_STATUS(vpiSubmitImageResampler(stream_b, image_r, scaled_r, VPI_INTERP_NEAREST, VPI_BOUNDARY_COND_ZERO));

and 0.2 -0.5ms for “downloading” using

            CHECK_STATUS(vpiImageLock(scaled_b, VPI_LOCK_READ, &outData_b));
            cv::Mat cvOut(outData_b.planes[0].height, outData_b.planes[0].width, CV_8U, outData_b.planes[0].data,

The time for the algo is about as fast as stated in the documentation. And the time for “downloading” seem reasonable.

But why is vpiImageWrapHostMem so slow? Isn’t it supposed to be called every loop iteration? Is there a more efficient way to process a continuous stream of images?

Or are there any examples that describe how to process video streams efficiently?

Thanks and Regards


In case you don’t know, have you maximized the device clock first?

$ sudo jetson_clocks

We want to reproduce this issue in our environment.
Would you mind to share a simple and complete sample with us first?


Hi AastaLLL,

I created a timing test from the resample and timing sample.



Like the resample demo, it reads in a grayscale image from disk. Then runs the “upload” (vpiImageWrapHostMem), resample (VPI_INTERP_NEAREST) and “download” (vpiImageLock) in a loop for 1000 iterations. The code has a memory leak, which I was not able to spot yet. So calling it for 10k iterations will cause an out of memory.

[JP 4.4, native without docker]

I then executed:

sudo jetson_clocks

For the CPU backend I get:

./warp_timing cpu /data/test.png

NVMEDIA_ARRAY: 53, Version 2.1
NVMEDIA_VPI : 156, Version 2.3
Num samples = 1000
avg. Upload → : 3.834107 ms
avg. Resample → : 0.496770 ms
avg. Download-> : 0.102679 ms
avg. Total → : 4.433556 ms

for Cuda backend I get: [Edit: corrected CUDA timing]

./warp_timing cuda /data/test.png

arp_timing cuda /data/test.png
NVMEDIA_ARRAY: 53, Version 2.1
NVMEDIA_VPI : 156, Version 2.3
Num samples = 1000
avg. Upload → : 3.841655 ms
avg. Resample → : 2.843834 ms
avg. Download-> : 0.388764 ms
avg. Total → : 7.074254 ms

Without the jetson_clocks it is around 20% slower.

I also took a look at the 09-tnr sample, that works with video data. In this example, the vpiImageWrapHostMem function is also called every iteration. But I asked myself, as the vpiImageLock has a Write mode, if it is possible to access the buffer for writing the same way as reading? Calling vpiImageWrapHostMem only in the initialization and the updating the buffer directly via vpiImageLock.


Thanks for the sample.
We will let you know once we got any progress.



We can reproduce the performance problem of vpiImageWrapHostMem and feedback it to our internal team.
The memory leakage issue is still under checking.

If your memory buffer doesn’t change, the vpiImageWrapHostMem can be wrap at the beginning just once.
But in general, the buffer pointer will differ on OpenCV interface.



The memory leakage should come from the cvOut that create by each iteration.
Would you mind to give it a check first?


Hi AastaLLL,

I already tested if the memory leakage comes from the cvOut by simply commenting it out.
But it didn’t solve the problem. I also tried around and commented out most of the other stuff, but the leakage seems to come from vpiImageWrapHostMem.



We got some reply from our internal team.

There is an optimization issue in our vpiImageWrapHostMem.
This will be fixed in our next release.

Currently, you can set the flag to VPI_BACKEND_ONLY_CPU as workaround.

diff --git a/main.cpp b/main.cpp
index 596cc04..554968b 100644
--- a/main.cpp
+++ b/main.cpp
@@ -143,7 +143,7 @@ int main(int argc, char *argv[])
                 // Wrap it into a VPIImage. VPI won't make a copy of it, so the original
                 // image must be in scope at all times.
-                CHECK_STATUS(vpiImageWrapHostMem(&imgData, 0, &image));
+                CHECK_STATUS(vpiImageWrapHostMem(&imgData, VPI_BACKEND_ONLY_CPU, &image));
             CHECK_STATUS(vpiEventRecord(evUpload, stream));

$ ./warp_timing cpu test.png
NVMEDIA_ARRAY: 53, Version 2.1
NVMEDIA_VPI : 156, Version 2.3
Num samples = 1000
avg. Upload → : 0.216582 ms
avg. Resample → : 0.173239 ms
avg. Download-> : 0.070388 ms
avg. Total → : 0.460208 ms

Although the performance is not good as it should be, it’s much better than the original case.

For the memory leakage problem, let us check it further and update more information with you.