Copy camera capture from nvbuffer to CPU DRAM is very slow using argus library

Hello,
I am using TX2+argus library+2 Leopard IMX577 cameras to capture images.
I got camera captures in nvbuffer and verified that the image is correct by display it to screen using OpenCV.
My problem is when I transfer the data from nvbuffer to CPU memory, it is really slow. My image size is 4056x3040 and using NvBufferColorFormat_ABGR32 format for nvbuffer(pitch layout is used, pitch=16384). The buffer contains about 47.5MB data for a single capture. And it takes me about 100ms to copy this 47.5MB data to CPU system memory. I have tried memcpy, OpenCV mat::copyTo and NvBuffer2Raw. They are all really slow. Our target is about 10-20ms for copying.

The purpose of this copy is that I need to convert this raw data to cv::Mat afterwards for further processing if necessary, for example, save to disk as jpeg when user requires that etc.

Do you have any solution for this?

Thanks,

Jon

For more details:
I am using the EGLStream to get EGLStream::Frame first. Then get the nv buffer by calling

EGLStream::NV::IImageNativeBuffer *iNativeBuf =interface_cast<EGLStream::NV::IImageNativeBuffer>(eglImage);

Then I create the nv buffer and copy the buffer to the CPU DRAM as following:

    if(m_nvBufFd1 == -1)
    {
        /* m_imgResolution1 is 4056x3040 */
        m_nvBufFd1 = iNativeBuf->createNvBuffer(m_imgResolution1, NvBufferColorFormat_ABGR32, NvBufferLayout_Pitch, m_config.m_rotation, &status);
        if(m_nvBufFd1 == -1 || status != STATUS_OK)
        {
            std::err << "Cannot create nv buffer";
        }
    }
    else
    {
        status = iNativeBuf->copyToNvBuffer(m_nvBufFd1, m_config.m_rotation);
        if(status != STATUS_OK)
        {
            std::err << "Cannot copy to nv buffer left";
        }
    }

    void *pdata = nullptr;
    NvBufferMemMap(m_nvBufFd1, 0, NvBufferMem_Read, &pdata);
    NvBufferMemSyncForCpu(m_nvBufFd1, 0, &pdata);

    unsigned char *p = (unsigned char*)malloc(m_imgResolution1.height() * m_nvBufParam.pitch[0]);
    /* !!! HERE IT IS REALLY SLOW !!!!, it takes about 100ms. I have tried memcpy, cv::Mat::copyTo and NvBuffer2Raw*/
    int cpRet = NvBuffer2Raw(m_nvBufFd1, 0, m_imgResolution1.width(), m_imgResolution1.height(), p);
    if(cpRet == -1)
    {
        std::err << "NvBuffer2Raw error";
    }
    free(p);

Hi,
Please try this method: to map NvBuffer to cv::Mat directly:
NVBuffer (FD) to opencv Mat - #6 by DaneLLL

Hi @DaneLLL ,
I have tried exactly the same method mentioned in your post initially. Actually mapping nvBuffer to cv::Mat directly and then calling cvtColor was my first method that I have tried. It is still slow. In details, I have 2 cameras in the same capturesession. The capture() command will do the capture for both cameras at the same time. So I obtained cv::Mat from cvtColor for left camera first, then immediately does the same thing for the right camera. I found that for the left camera, cvtColor is really slow, it is about 70-80ms. However, for the right camera, it is pretty fast, around 15-18ms.

Let me know if you need my code for further debug, if yes, I will write up a simple version that can be shared with you.

Jon

Hi,
This is the optimal solution. There may not have much room for improvement. Please execute sudo tegrastats and check if the CPU cores are at full loading, capping the performance.

We have VPI functions for image processing. It shall bring better performance, Please take a look and see if you can find suitable functions for your use-case:
VPI - Vision Programming Interface: Main Page

Hi @DaneLLL
I have tried to use VPI. I think if I can pass the VPIImage to the application code that uses my library, that would be great. However I found a problem there.
My library is simple, one thread keep reading image every 100ms to VPIImage.
Main thread, which is the application will read the VPIImage and access the data.

However, when the application (main thread) accesses the VPIImage, it gives me either bus error or segmentation error.

I wrote a sample code based on format_convert sample code that can reproduce this issue. In the sample code, I have two threads, threadTask1 create VPIImage t1_imageGray based on the wrapped OpenCV VPIImage. The code in main thread tries to access t1_imageGray and save it to disk by using cv::imwrite. I got segmentation fault on this.
However, the threadTask2, does everything including saving the image to the disk by using cv::imwrite. In other words, everything is done in the same thread. This one has no problem at all.

My Environment: TX2+argus library (0.97.3)+2 Leopard IMX577, jetson 4.5, VPI version :

libnvvpi1/stable,now 1.0.15 arm64 [installed]
  NVIDIA Vision Programming Interface library

I also got these logs when creating the VPIImage:

NVMEDIA_ARRAY:   53,  Version 2.1
NVMEDIA_VPI :  172,  Version 2.4

Can you help me to point out what I am doing wrong? And help to fix the issue?

Thanks

Jon
format_convert_thread.zip (1.9 KB)

Hello @DaneLLL ,
Did you have time to look at this issue yet? We are currently in a quite tight schedule for the product development, this is pretty high priority for us.

Thanks,

Jon

Hi,
What VPI functions you need in your use-case? VPI is for replacing OpenCV. You can get NvBuffer from Argus and call the functions to get vpiImage:
https://docs.nvidia.com/vpi/group__VPI__NvBufferInterop.html

Hi @DaneLLL ,
could you please read through the code and answer my question at here? I wrote it in this post at Jan 31.

It looks like you did not read all my posts yet. Regarding to nvbuffer wrapping API, I think I have no problem of using that. My current issue is how to use the VPI in multi-threaded program. Please do read through my example code (in the post of Jan 31) and let me know how to fix it.

Hi,
We have checked the sample but don’t quite understand the use-case. Please share more detail. The frame data from Argus is in NV12 and it seems like you would like to convert it to gray format? After you get gray data, which VPI functions are called?

Please check and confirm whether your use-case is like:

Get NvBuffer from Argus -> vpiImageCreateNvBufferWrapper() -> vpiSubmitConvertImageFormat() -> which VPI function is called here?

Hi @DaneLLL ,
sorry for the confusion of the sample. My issue right now is that I cannot use the VPIImage in other thread than the one that allocates the memory for it. If you run my program, you will find that the threadTask1 will get segmentation fault always.

The sample is just to demonstrate the issue I am currently facing, it has nothing to do with my use-case. I can make my application working with VPI in single thread. However, the camera capturing and VPIImage processing thread is not the main thread. My issue right now is that in the main thread, I want to use the VPIImage generated from the camera capturing thread, however, when in the main thread, I try to access the VPIImage, it gives me either bus error or segmentation fault.

I guess I will need to do some context management or something?? please help

Jon

Hi @DaneLLL ,
Responding to your question,

Here is what the application does:

// In camera capturing thread
Get NvBuffer from Argus -> vpiImageCreateNvBufferWrapper()/* Create a VPIImage nvBufImgLeft and nvBufImgRight */ -> vpiImageCreate() /* Create two VPIImage rgbImgLeft and rgbImgRight */-> vpiSubmitConvertImageFormat() /* Create color format from BGRA to RGB */ -> vpiStreamSync /* Sync rbgImgLeft and rgbImgRight */ ->save rgbImgLeft and rgbImgRight in a frame queue. 

// In program main thread
Read rgbImgLeft and rgbImgRight from the frame queue -> get cv::Mat from VPIImage the following function:
cv::Mat getMatFromVpiImage(VPIImage& rgbImg)
{
    VPIImageData outData;
    vpiImageLock(rgbImg, VPI_LOCK_READ, &outData);
    cv::Mat cvOut(outData.planes[0].height, outData.planes[0].width, CV_8UC3, outData.planes[0].data,
                  outData.planes[0].pitchBytes);

    cv::Mat retMat = cvOut.clone(); /* This is where I have segmentation fault or bus error */

    vpiImageUnlock(rgbImg);
    return retMat;
}

I think my issue is at how to access the VPIImage memory from another thread. I hope this explanation helps.

Thanks

Jon

Hi,

There are two issues in your sample.

1. First, since stream is shared by threads, please use synchronization call rather than destroy it.
(Actually, this looks like a typo to me.)

void threadTask2(std::string fileName, VPIStream stream, VPIImage img, VPIImage imgGray)
{
    ...
    vpiStreamSync(stream);
    vpiImageDestroy(img);
    ...
}

2. Second, since you want to access the data prepared in the thread, you should pass t1_imageGray by reference rather than by value.

int main(int argc, char *argv[])
{
    ...
    std::thread vpiImageWork_t1(threadTask1, argv[1], stream, t1_image, &t1_imageGray);
    ...
}

Below is the complete change for running your sample correctly:

diff --git a/main.cpp b/main.cpp
index 51e31b4..0c09a12 100644
--- a/main.cpp
+++ b/main.cpp
@@ -10,7 +10,7 @@
 #include <vpi/algo/ConvertImageFormat.h>
 using namespace std::chrono;
 
-void threadTask1(std::string fileName, VPIStream stream, VPIImage img, VPIImage imgGray)
+void threadTask1(std::string fileName, VPIStream stream, VPIImage img, VPIImage* imgGray)
 {
     cv::Mat cvImage = cv::imread(fileName);
     if (cvImage.data == NULL)
@@ -20,9 +20,9 @@ void threadTask1(std::string fileName, VPIStream stream, VPIImage img, VPIImage
     }
 
     vpiImageCreateOpenCVMatWrapper(cvImage, 0, &img);
-    vpiImageCreate(cvImage.cols, cvImage.rows, VPI_IMAGE_FORMAT_U8, 0, &imgGray);
+    vpiImageCreate(cvImage.cols, cvImage.rows, VPI_IMAGE_FORMAT_U8, 0, imgGray);
 
-    VPIStatus st = vpiSubmitConvertImageFormat(stream, VPI_BACKEND_CUDA, img, imgGray, NULL);
+    VPIStatus st = vpiSubmitConvertImageFormat(stream, VPI_BACKEND_CUDA, img, *imgGray, NULL);
     if(st != VPI_SUCCESS)
     {
         std::cout << "vpiSubmitConvertImageFormat error=" << st << std::endl;
@@ -61,7 +61,7 @@ void threadTask2(std::string fileName, VPIStream stream, VPIImage img, VPIImage
 
     vpiImageUnlock(imgGray);
 
-    vpiStreamDestroy(stream);
+    vpiStreamSync(stream);
     vpiImageDestroy(img);
     vpiImageDestroy(imgGray);
 }
@@ -82,7 +82,7 @@ int main(int argc, char *argv[])
     VPIImage t1_image, t1_imageGray;
     VPIImage t2_image, t2_imageGray;
 
-    std::thread vpiImageWork_t1(threadTask1, argv[1], stream, t1_image, t1_imageGray);
+    std::thread vpiImageWork_t1(threadTask1, argv[1], stream, t1_image, &t1_imageGray);
     std::thread vpiImageWork_t2(threadTask2, argv[1], stream, t2_image, t2_imageGray);
 
     vpiImageWork_t1.join();

Thanks.

Hi @AastaLLL
I have used your method and it worked on my sample code but I have not tried it on my formal application yet. I also used std::shared_ptr to allocate the VPIImage and pass the shared pointer around the different threads, it works as well.

One thing I do not really understand is that VPIImage is essentially just pointer of struct VPIImageImpl.

// This is in /opt/nvidia/vpi1/include/vpi/Types.h
/**
 * A handle to an image.
 * @ingroup VPI_Image
 */
typedef struct VPIImageImpl *VPIImage;

My original way to pass the VPIImage is to just copy the VPIImage to another thread (as an argument of course). The value of VPIImage is the same in both threads and they should point to the same memory location of VPIImageImpl instance created by vpiImageCreate.

My point is the underlining data(instance of VPIImageImpl ) pointed to by VPIImage are the same in different thread. Why is it not working when passing VPIImage? Why we should pass VPIImage* to make it work in different thread?

Jon

Hi @AastaLLL or @DaneLLL
I have not applied your suggestion to our production code yet. I need to get full understanding before applying the changes. Could you please answer my question in previous post so that I can proceed. Many thanks!

Jon

Hi,
VPIImage is stored in structure and passing a pointer to a structure to a thread looks common in C code programming. Not sure why there is further concern about this. Would suggest you check the patch and follow it.

Hi @DaneLLL
why do I need to pass VPIImage* instead of VPIImage between different thread to make it work?
Here is what I do not understand:
VPIImage is a pointer to a struct, passing VPIImage to different thread equals to passing pointer to different thread which I do not think it is an issue. However, the reality is that I need to pass VPIImage*, why is that?

Thanks,

Jon

Hi,
t1_imageGray is accessed in main thread:

vpiImageLock(t1_imageGray, VPI_LOCK_READ, &outData);

So if you create it in threadTask1(), main thread cannot read VPIImage correctly.

t2_imageGray is only accessed in threadTask2() so it does not have the issue.

Another thing is you access t1_imageGray in main thread but has destroyed it in tail of threadTask1():

vpiImageDestroy(imgGray);

Suggest move all destroy calls to tail of main thread.

Hi @DaneLLL
I have corrected my function and now it points to the correct VPIImage in different thread. However, there is still another issue. When I try to save the image to the jpg file using opencv imwrite, it gives me bus error and most of the time the image is partially save, sometimes, you cannot even read the image at all. Please see the attachment below for what I mean by partially saved.

The workflow of the code is

In camera capturing thread:
Get NvBuffer from Argus -> vpiImageCreateNvBufferWrapper() -> vpiImageCreate() -> vpiSubmitConvertImageFormat() -> vpiStreamSync()->assign VPIImage to a frameBuffer queue that both accessed by capturing thread and main thread.

In main thread
Get VPIImage from the frameBuffer  queue -> vpiImageLock -> create cvOut ->imwrite -> vpiImageUnlock

In the main thread, the imwrite will crash with bus error. If I call imwrite in capturing thread, it works for me. I have verified that the VPIImage values are the same between two threads.

Please help, if you need to see the code, please let me know, I can copy the necessary part of the code to you.

Jon

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi,
Please share a patch to reference samples in:

/opt/nvidia/vpi1/samples

so that we apply the patch and reproduce the issue.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.