VPI: Unable to write to VPI_ARRAY_TYPE_U8 on jetson device

Hello,

I think I may have found a bug. I was lightly modifiying the pyramidal LK optical flow example to track points between current and previous frames following this tutorial:

https://docs.nvidia.com/vpi/2.1/sample_optflow_lk.html

When I execute this on an x86 machine I get no problems using VPI 2.0 or VPI 2.2 however on a jetson NX ( 5.0.2 using an ngc container version 35.1) in a l4t-jetpack container using VPI 2.1.6 I cannot write to the status array at the end of my function ready for the next loop.

This is problematic as the status array is a VPI_ARRAY_TYPE_U8 (8-bit array) indicating 0 or 1 if the point is tracked or not. For the next frame all values need to be set back to 0 so all new incoming points are tracked.

However, I cannot write 0 to the status array for the new incoming points. I try like this:

    VPIArrayData statusBuff; 
    CHECK_VPI(vpiArrayLockData(featStatus, VPI_LOCK_WRITE, VPI_ARRAY_BUFFER_HOST_AOS, &statusBuff));

    uint8_t* featStatusArray = (uint8_t*)statusBuff.buffer.aos.data;

    for (int i=0; i<numPoints; i++)
    {
        featStatusArray[i] = 0; 
    }

    CHECK_VPI(vpiArrayUnlock(featStatus));

And also memsetting the pointer like this:


    memset(featStatusArray, 0, maxCorners);

On an x86 machine the array is written to, on a jetson device it is not so remains 1, etc. which means the coordinate at that index position is no longer tracked. A memset function in the VPI API would be very useful.

I have no problems writing or reading to arrays of type VPI_ARRAY_TYPE_F32 but need a VPI_ARRAY_TYPE_U8 for the status as shown here: VPI - Vision Programming Interface: Pyramidal LK Optical Flow

Please could you describe to me whether this is a bug or I am doing something wrong. Equally, if someone could give me more information if I need different backend settings, etc due to the different memory layout (shared CPU and GPU memory) compared to an ordinary PC.

At the moment it seems like the memory is copied into a VPIArrayData object for me to access upon locking and copied back (hopefully) when it is unlocked, however this can’t be the case for U8.

Also, if someone could give me direction if it makes and difference what type I cast the pointer to. I have tried unsigned char* and uint8_t as per the example. I also tried memsetting it as a void* pointer as I am trying to set all values (and bytes) to zero.

The only way around this I have found is to destroy the status buff and reallocate it (vpiArrayCreate) every frame (to get a clean array of zeros), however this is not an optimal solution and I eventually get a VPI OOP error despite destroying it at every time afterwards.

Thanks, Tom.

I have also read this similar answer but have found no inspiration: Editing data on a VPIArray

Hi,

This sounds like an issue but we need to check it further to confirm.
Could you share a simple complete reproducible source so we can test it in our environment?

Thanks.

Hi AastaLLL,

Thanks for getting back to me. Apologies for the delay. I have managed to produce a simple (ish!) example for you using some frames from the dashcam video supplied in the vpi examples and combining the LK optical tracker and Harris corner examples together to match a similar implementation to my own.

I have made a repository here with 20 test frames and some results:

I hope I have included a thorough-enough explanation of the code and results in the README.

Essentially, I am using the Harris corners API of the VPI library to generate some keypoints on each image frame and track them to the next frame using the LK optical flow implementation.

I can tell which have been tracked in the featStatus array. If the featStatus array is zeroed for the next iteration (all points to be tracked) then this should give tracked results for the next frame with a new set of points independent of whatever was in the previous frame.

The alternative is to create and destroy the featStatus array for each frame which is also demonstrated. If the buffer is zeroed as hoped then it should provide identical results to creating and destroying the buffer each time. This is the case when using an x86-64 machine but not on a jetson NX (I have tested 35.1 but also jetpack 5.1 (is that 35.2) both in and outside a container using VPI 2.1.6.

I believe this is a bug either in the Jetson VPI library implementation or some kind of memory management bug when the featStatus U8 buffer is unlocked. If you could give assistance of if this could be fixed or a work around that would be most useful.

Creating and destroying the buffer everytime is not a viable option as after so long I get a VPI_OUT_OF_MEMORY error.

Please ask if you have any questions.

To elaborate slightly on my first post. While making the example I have found that when I write 0 to each element in the featStatus array after use, if the array is locked again and read from it will show that each element = 0. I have a little function called debugCheckStatusHasBeenSetToZero() to show this.

However, if this was the case then the results would be identical between createDestroyStatusArray() and reuseStatusArray(), however they are not on a jetson device which means the values in featStatus must not have been written to previously.

Hi,

Thanks for the reproducible sample.

We are going to reproduce this issue internally.
Will share more information with you later.

Thanks.

1 Like

Hi AastaLLL,

I guess you guys are still working things out. Any idea when you could give me an update?

Thanks,

Tom.

Hi,

Thanks for your patience.

It seem that there only contains the sample for VPI_ARRAY_TYPE_U8.
Could you also add the example for VPI_ARRAY_TYPE_F32 so we can compare the results.

Thanks.

Hi,

We tried to reproduce this issue with our 12-optflow_lk sample.
But the reset function works as expected in our environment.

Could you check if there is any difference between your implementation and our sample?

1.

Apply below diff to /opt/nvidia/vpi2/samples/12-optflow_lk

diff --git a/main.cpp b/main.cpp
index aa009dc..a9bb75b 100644
--- a/main.cpp
+++ b/main.cpp
@@ -168,7 +168,7 @@ static int UpdateMask(cv::Mat &cvMask, const std::vector<cv::Scalar> &trackColor
     const VPIArrayBufferAOS &aosStatus      = statusData.buffer.aos;
 
     const VPIKeypointF32 *pCurFeatures = (VPIKeypointF32 *)aosCurFeatures.data;
-    const uint8_t *pStatus             = (uint8_t *)aosStatus.data;
+    uint8_t *pStatus             = (uint8_t *)aosStatus.data;
 
     const VPIKeypointF32 *pPrevFeatures;
     if (prevFeatures)
@@ -204,6 +204,9 @@ static int UpdateMask(cv::Mat &cvMask, const std::vector<cv::Scalar> &trackColor
         }
     }
 
+    for (int i=0; i<aosStatus.capacity; i++)
+        pStatus[i] = 0;
+
     // We're finished working with the arrays.
     if (prevFeatures)
     {
@@ -212,6 +215,17 @@ static int UpdateMask(cv::Mat &cvMask, const std::vector<cv::Scalar> &trackColor
     CHECK_STATUS(vpiArrayUnlock(curFeatures));
     CHECK_STATUS(vpiArrayUnlock(status));
 
+
+    // check status reset
+    CHECK_STATUS(vpiArrayLockData(status, VPI_LOCK_READ, VPI_ARRAY_BUFFER_HOST_AOS, &statusData));
+    const VPIArrayBufferAOS &aosStatus2 = statusData.buffer.aos;
+    const uint8_t *pStatus2 = (uint8_t *)aosStatus.data;
+
+    for (int i=0; i<aosStatus.capacity; i++) {
+        if( pStatus2[i]!=0 ) std::cout << "Reset fail!" << std::endl;
+    }
+    CHECK_STATUS(vpiArrayUnlock(status));
+
     return numTrackedKeypoints;
 }

2. Build

$ cmake .
$ make
$ ./vpi_sample_12_optflow_lk cuda ../assets/dashcam.mp4 5 frame.png

There is no “Reset fail!” showing up which means the status array is all zeros.
Thanks.

Hi AastaLLL,

Thanks for getting back to me. Apologies for the slow reply.

I can see what you’ve tried, similar to me checking that elements in the status array = 0 once you have iterated through them to set them to 0 in the updateMask function.

I have a function to make a similar check called debugCheckStatusHasBeenSetToZero in my example.

The difference with my implementation (and what my example shows) is that I use the status buffer again to record tracked status for a new image with a newly-generated set of keypoints (the keypoints found in the current frame become the prevFeatures compared to in the next frame).

I have found when I do this, even though I have read all elements in the status array = 0, I get a different number (fewer) tracked points than I was expecting when I I create and destroy the status array to guarantee 0 values for the new image.

You can see this behaviour comparing the number of tracked points printed out for each image when reusing the buffer compared to creating/ destroying the buffer in my example.

Let me know what you think.

Hi,

Do you indicate that the buffer needs to be used again with vpi.OpticalFlowPyrLK to reproduce the issue?
We have re-wrapped the buffer, and it looks correct.

Thanks.

Hi AastaLLL, yes, please can you try and reuse the buffer with vpi.OpticalFlowPyrLK to reproduce. Please test the number of tracked points after each application of OpticalFlowPyrLK and compare to when the buffer is created and destroyed rather than reused. Then the issue will be apparent.

I have also tested myself reading and writing to the buffer without reusing it via OpticalFlowPyrLK and have been unable to reproduce the issue. Only when OpticalFlowPyrLK is used and the number of tracked points compared is this obvious.

Thanks, Tom.

Hi,

Sorry for the late reply.
We will give it a check and update here soon.

Thanks a lot for your patience.

Hi Aasta,

Thanks for getting back to me. It would be great for you to try it. Whoever ends up testing it, I would encourage them to have a go at using my example above as it is simple to compile and will show easily what the problem is (and there are some results to compare to).

Kind regards,

Tom.

Hi,

Sorry, it takes some time to debug this issue.

This is related to the synchronization mechanism in the VPI array buffer.

There are two backend buffers created for the VPI array, CPU and GPU buffer.
In your use case, the overwritten is applied to the CPU buffer.
But it is the GPU buffer that has been used for the next tracking job (the backend is GPU).

For now, please set zero on the GPU buffer if the GPU backend is chosen.
Or use the CPU backend if you want to set zero with the CPU.

Two possible WAR:

1. Using CPU buffer → change tracking backend to CPU.

diff --git a/VPI_Example.cpp b/VPI_Example.cpp
index 6ea7137..2e2447d 100644
--- a/VPI_Example.cpp
+++ b/VPI_Example.cpp
@@ -333,7 +333,7 @@ private:
     VPIImage inputImage; 
     VPIImage prevImage;
 
-    VPIBackend backend = VPI_BACKEND_CUDA; 
+    VPIBackend backend = VPI_BACKEND_CPU;
     VPIImageFormat format = VPI_IMAGE_FORMAT_U8;
 
     VPIPyramid pyrPrevFrame=NULL;

Output

$ ./VPI_Example 
_____REUSING STATUS ARRAY.  ZEROING STATUS BUFFER: 1_____
1: Point in prev array: 65 number of points tracked: 0
2: Point in prev array: 57 number of points tracked: 63
3: Point in prev array: 62 number of points tracked: 55
4: Point in prev array: 56 number of points tracked: 61
5: Point in prev array: 57 number of points tracked: 54
6: Point in prev array: 43 number of points tracked: 56
7: Point in prev array: 44 number of points tracked: 43
8: Point in prev array: 36 number of points tracked: 44
9: Point in prev array: 28 number of points tracked: 36
10: Point in prev array: 25 number of points tracked: 26
11: Point in prev array: 21 number of points tracked: 22
12: Point in prev array: 12 number of points tracked: 19
13: Point in prev array: 23 number of points tracked: 9
14: Point in prev array: 26 number of points tracked: 17
15: Point in prev array: 24 number of points tracked: 18
16: Point in prev array: 22 number of points tracked: 18
17: Point in prev array: 15 number of points tracked: 18
18: Point in prev array: 11 number of points tracked: 8
19: Point in prev array: 11 number of points tracked: 9
____CREATING/ DESTROYING STATUS ARRAY EVERY O.F. CYCLE.  ZEROING STATUS BUFFER: 1_____
1: Point in prev array: 65 number of points tracked: 0
2: Point in prev array: 57 number of points tracked: 63
3: Point in prev array: 62 number of points tracked: 55
4: Point in prev array: 56 number of points tracked: 61
5: Point in prev array: 57 number of points tracked: 54
6: Point in prev array: 43 number of points tracked: 56
7: Point in prev array: 44 number of points tracked: 43
8: Point in prev array: 36 number of points tracked: 44
9: Point in prev array: 28 number of points tracked: 36
10: Point in prev array: 25 number of points tracked: 26
11: Point in prev array: 21 number of points tracked: 22
12: Point in prev array: 12 number of points tracked: 19
13: Point in prev array: 23 number of points tracked: 9
14: Point in prev array: 26 number of points tracked: 17
15: Point in prev array: 24 number of points tracked: 18
16: Point in prev array: 22 number of points tracked: 18
17: Point in prev array: 15 number of points tracked: 18
18: Point in prev array: 11 number of points tracked: 8
19: Point in prev array: 11 number of points tracked: 9
...

2. Using GPU buffer → overwrite status with CUDA.

Since vpiArrayLockData(.) only supports HOST buffer wrapping.
Please try if you can wrap a GPU buffer (so you will have the pointer) with vpiArrayCreateWrapper(),

https://docs.nvidia.com/vpi/group__VPI__Array.html#ga4d59031e43ef10046675f38e54423374

Thanks.

Hi Aasta,

No worries, thanks for getting back to me.
That sounds promising indeed.

With the GPU backend, I am guessing you are suggesting zeroing the pointer I have used for wrapping the array using cudaMemset or writing elements to zero in a kernel ?

I shall give it a try and confirm. Thanks to you and your colleagues for all your hard work.

Kind regards,

Tom.

Hi Aasta,

I have had a try but sadly no luck.

Please could you give me an example on how to use the vpiArrayCreateWrapper()?

I have added a new branch with my attempt but I consistently get:

_____REUSING STATUS ARRAY.  ZEROING STATUS BUFFER: 1_____
terminate called after throwing an instance of 'std::runtime_error'
  what():  VPI_ERROR_INVALID_ARGUMENT: (cudaErrorInvalidValue) file: /Garford/VPI_Example/VPI_Example.cu:120

Aborted (core dumped)

My attempt creating the wrapped array is here:

    void createStatusArray()
    {
        // CHECK_VPI(vpiArrayCreate(maxCorners, VPI_ARRAY_TYPE_U8, 0, &featStatus));
        
        featStatusData.bufferType = VPI_ARRAY_BUFFER_CUDA_AOS;
        featStatusData.buffer.aos.sizePointer = &numStatusPoints; 
        featStatusData.buffer.aos.capacity = maxCorners; 
        featStatusData.buffer.aos.strideBytes = maxCorners; 

        gpuErrchk(cudaMalloc(&featStatusData.buffer.aos.data, maxCorners)); 

        featStatusData.buffer.aos.type = VPI_ARRAY_TYPE_U8;

        CHECK_VPI(vpiArrayCreateWrapper(&featStatusData, VPI_BACKEND_CUDA, &featStatus)); 
        
        std::cout << "Created array wrapper" << std::endl;;
    }

I am specifically struggling with adding the sizePointer parameter. I am currently setting it as a pointer to an int32_t value initialised as 0 which I update when I update numFeatures2Track in computePoints2Track().

I momentarily had it working on my x86_64 machine but now get VPI_ERROR_INVALID_ARGUMENT on both my laptop and NX board.

Kind regards,

Tom.

Hi,

It looks like there are some issues when wrapping a CUDA buffer to a VPI array.

We are checking this issue with our internal team.
Will let you know the following.

Thanks.

Hi Aasta,

I guess you guys are still working on it but let me know when you have an update.

Kind regards,

Tom.

Hi,

We are still checking this issue internally.
Will let you know once there is any progress.

Thanks.