How to avoid double copy of image from EGL stream, when creating batched nvbuf surface

**• Hardware Platform Jetson ** - Jetson Xavier AGX and NX
• DeepStream Version - Deepstream 6.2
• JetPack Version (valid for Jetson only) - L4T R35.3.1
• TensorRT Version N/A
• Issue Type - questions

So I’m having my own gstreamer plugin to process frames from multiple sensors, batch it into one package and push it downwards the gstreamer pipeline to nvinfer. This plugin is based on gstnvarguscamerasrc plugin. The algorithm used in my plugin is similar to the one in gstnvarguscamerasrc, except for the batching part:

  • 4 “Sensor’s threads” to retrieve the image and copy it (copy #1) into dmabuf:
      // Get the IImageNativeBuffer extension interface and create the fd.
      NV::IImageNativeBuffer *iNativeBuffer =
        interface_cast<NV::IImageNativeBuffer>(iFrame->getImage());
      if (!iNativeBuffer)
        ORIGINATE_ERROR("IImageNativeBuffer not supported by Image.");

      if (src->frameInfo->fd < 0)
      {
        // Create dmabuf descriptor for one-batched surface
        src->frameInfo->fd = iNativeBuffer->createNvBuffer(streamSize,
                NVBUF_COLOR_FORMAT_YUV420,
                NVBUF_LAYOUT_BLOCK_LINEAR);
        if (!src->silent)
          CONSUMER_PRINT("Acquired Frame. %d\n", src->frameInfo->fd);
      }
      // Copy #1
      else if (iNativeBuffer->copyToNvBuffer(src->frameInfo->fd) != STATUS_OK)
      {
        ORIGINATE_ERROR("IImageNativeBuffer not supported by Image.");
      }

Main thread to process frame and push it to the pipeline. Copy dmabuf retrieved from every sensor into 4-batched surface:

    // Create 4 batched surface and fill it in with framebufs from 4 sensors (copy #2):
    ret = gst_buffer_pool_acquire_buffer (src->pool, &buffer, NULL); // batch is 4 frames here
   .....
     // dmabuf descriptors are for 4-batched surface
    NvBufSurface* surf = (NvBufSurface *)outmap.data;
    NvDsBatchMeta* batch_meta = createBatchMetaData(src);
    NvBufferTransformParams transform_params = {};

    for (int i = 0; i < src->sensors.size(); i++) {
        // Copy #2
        int retn = NvBufferTransform(consumerFrameInfo[i]->fd, (gint) surf->surfaceList[i].bufferDesc, &transform_params);
        .... // Add some metadata to the batch
  
      }
      ....      // Release used buffer to allow sensor thread grab new image
  }

  surf->numFilled = src->sensors.size();
  surf->batchSize = src->sensors.size();

So this algorithm works fine. I’m getting what I need. But I have double copy in my algorithm and I started to notice that on JP5 based devices my FPS performance is significantly worse than on JP4 based devices (L4T R32.4.4, Deepstream 5.0).
Some quick profiling showed that combined frame copying(including both copy) is taking almost all frame time (target is 30FPS). That is, JP5 devices are processing frames on average for 35ms, while JP4 devices still remain within 33ms under same setup.

So my main idea to reduce resource consumption would be to avoid double copy while processing the frame. With my new approach I’m doing the following:
In main thread:

  • Acquire gstbuffer from the pool (this would be 4 batched NvBufSurface)
  • Pass the dmabuf fd from this surface(BuffDesc) to sensor thread (1x1 mapping between surface batch list and sensor’s threads).
  • Don’t copy the dmabuf anymore, expect the image will be copied directly into surf->surfaceList[sensor_id].bufferDesc (for every batch)

In sensor thread (we have 4 such threads) we do now the following:

src->frameInfo->fd = dmabuf[sensor_idx]; // dmabuf[sensor_idx] is basically an array of descriptors, collected from **surf->surfaceList** array (bufferDesc field)
iNativeBuffer->copyToNvBuffer(src->frameInfo->fd); // Note that we don't create dmabuf anymore. Use preallocated fd from gst_buffer.

So the idea would be that we copy the image straight into the dmabuf from NvBufSurface that is mapped directly to our gst buffer. And this algorithm works fine for JP4 devices. But not for JP5. What I’m getting on the pipeline output for JP5 device is 3 empty buffers (green screen) for last 3 batches. The screen for first batch is showing me interchangeably frames from all 4 sensors. So it feels like iNativeBuffer->copyToNvBuffer(src->frameInfo->fd); is copying all frames straight into batch #1 (and just whatever thread was the last to execute will get its buffer stored and displayed).

Our further debugging showed that iNativeBuffer->copyToNvBuffer(src->frameInfo->fd) internally is using NvBufSurfaceFromFd. And NvBufSurfaceFromFd will always create same NvBufSurface for all 4 descriptors provided from allocated gst buffer. And for some reason the copying functionality ignores provided dmabuf fd, it is just copying image to first surfaceList from that surface. Please note that we also see that on JP4 copyToNvBuffer doesn’t use NvBufSurfaceFromFd and thus I’m getting exactly what I need, all 4 images are shown correctly.

Looking at examples from nvidia-l4t-jetson-multimedia-api_35.1.0-20220825113828_arm64.deb, I noticed that surf->surfaceList is always referenced either as pointer to surface or as surfaceList[0]. Meaning in all facilities in this package only first batch is being taken into account, any other batches in the surfaceList are ignored.

So with this story I have few questions:

  • By creating a batch of frames in our custom cameraargussrc plugin from multiple sensors, are we taking the right approach? Maybe there is other ways to create batched frame(surface) for nvinfer processing
  • If this is the right way to create batched frame, how can I avoid double copying in my plugin? Given that I retrieve the image from EGL stream and package it into 4-batched surface. I looked at different APIs, for NativeBuffer, Image inerface, NvBuffer, NvBufSurface. But none of those seem to help me avoid double copy.
  • Can you confirm I’m using NvBufSurface correctly in this scenario? Also is it expected that providing 4-batched surface to iNativeBuffer->copyToNvBuffer will always write to 0 frame and never to other batches, no matter what dmabuf descriptor I provide?

Any help or suggestions would be very much appriciated.

1 Like

Have you tried with DeepStream? Welcome to the DeepStream Documentation — DeepStream documentation 6.4 documentation

Thanks Fiona for your reply, but can you be more specific what do you suggest?
For us the preferred option would be to prepare batch programmatically in one plugin, instead of using pipeline with nvvideoconvert and nvstreammux. Could we get some advice regarding achieving this in one gstreamer plugin?

Can you tell us why?

What do you mean by double copy? When you get data from the camera, the data is already in NV12 format with NvBufSurface. It can be directly batched with nvstreammux without any conversion. I don’t know why did you need to combine the batch by yourself. Where, how and why did you copy the buffers?

There are few things to that. But mainly we want to synchronize lighting settings across all sensors(details of our implementation of tegra board).

Also are you sure in the scenario you described there is only one copy? I mean, we copy image for sure to extract it from sensors and push it further down the pipeline(like gstnvarguscamerasrc does). But I’m suspicious (still to verify) nvstreammux will do the internal copy as well, causing same “double copy”.

So for us it felt natural to use our own plugin to gain more control over frame processing and settings synchronization. But we just can’t figure out how do you create a 4-batched frame based on 4 1-batched frames without copying dmabuf from “sensor” frame to batched package. From what I can tell looking at NvBufSurface interface it is not possible unless you do this additional copy.

We are sure. There is no copy inside nvstreammux if there is no scaling configuration.

The so-called “copy” is cuda memory copy, it is very efficient, you can ignore it. We already implement the batch generation in DeepStream. We recommend our customers to use our solution when the nvstreammux can meet the requirement.

This is great! Could you refer me to this cuda memory copy API to prepare 4-batch frame? We are using NvBufferTransform to prepare this batch. I’m wondering is that transform is cuda mem copy or is there a different API to use?

There is no need to copy for combining the batch data.

Why did you have to use NvBufferTransform?

Please use DeepStream pipeline to avoid the double “copy”.

I understand your suggestion, but I’m not convinced this is the solution for my problem.
I will simplify scenario described ion the summary, hoping you could give me an answer to a simple question whether such a thing is possible.
So I have this plugin, that extracts images from 4 sensors and sends it down the pipeline as a batch. It’s a libargus based plugin:

// thread initialization:
.....
// We have allocated pool of buffers that will hold a batched frame (batch-size is 4).
  src->pool = gst_nvds_buffer_pool_new();

  GstStructure *config = gst_buffer_pool_get_config (src->pool);
  gst_buffer_pool_config_set_params (config, src->outcaps, sizeof (NvBufSurface), MIN_BUFFERS, MAX_BUFFERS);
  gst_structure_set(config, 
                    "memtype", G_TYPE_INT, NVBUF_MEM_DEFAULT,
                    "gpu-id", G_TYPE_UINT, 0, 
                    "batch-size", G_TYPE_UINT, 4, NULL);
  gst_buffer_pool_set_config (src->pool, config);

// And here we set up array of consumers (I have it as separate threads, but it shouldn't matter for my case.
IFrameConsumer consumers[4] = {0};
// Init code here......


//Now, in the main loop
.....
    int fds[4] = {0};
    ret = gst_buffer_pool_acquire_buffer (src->pool, &buffer, NULL);

    GstMapInfo outmap = GST_MAP_INFO_INIT;

    if (!mapBuffer(outmap, buffer)){
      GST_ERROR_OBJECT(src, "no memory block");
    }

    NvBufSurface* surf = (NvBufSurface *)outmap.data;

    for (int i = 0; i < src->sensors.size(); i++) {
        fds[i] = surf->surfaceList[i].bufferDesc;
    }
// Effectively what we did here was:
// - allocate gst buffer and retrieve NvBufSurface from acquired buffer
// - Extract dma buffers from every batch into fds array for further usage
.....

// Acquire frames from every sensor and store it to buffer
        for (int i = 0; i < 4; i++) {
            UniqueObj<Frame> frame(
                iFrameConsumer->acquireFrame(consumer_wait_time_us * 1000));
            IFrame* iFrame = getIFrameInterface(frame);
            NV::IImageNativeBuffer* iNativeBuffer = interface_cast<NV::IImageNativeBuffer>(iFrame->getImage());
            iNativeBuffer->copyToNvBuffer(fds[i]);
        }

Please note this is pseudo code and i don’t have it in one thread, but for the purpose of illustration of my setup, it should work.
So, I’m using NV::IImageNativeBuffer to extract ELGImage from the consumer and I want to store it in my custom dmabuf (that is, I’m not calling iNativeBuffer->createNvBuffer as I want to use dmabuf that is already pre-allocated in gst_buffer_pool.
So given the task to create 4-batched GST_BUFFER based on 4 EGLStreams, does this program makes sense?
For me this code works perfectly fine on JP4 devices (deepstream 5.0 and l4t - 32.4.4. But it doesn’t work on JP5 device (deepstream 6.0 and l4t 35.3.1).
output
You can find the result of execution in gif attached. To generate this image I used the following pipeline on the JP5 device:

gst-launch-1.0 mycustomarguscamerasrc num-buffers=900 num-sensors=4 ! queue ! nvmultistreamtiler width=1920 height=1080 rows=2 columns=2 ! nvvideoconvert ! "video/x-raw(memory:NVMM),width=1920,height=1080,format=I420" ! nvv4l2h265enc ! h265parse ! matroskamux ! filesink location="test.mkv"

Can you help me understand what exactly am I doing wrong here and how can I prepare 4-batched frame for GST_BUFFER?

What is your purpose to combine the batch? To adapt the customized nvarguscamerasrc to the DeepStream pipeline? If you only want to tile the multiple cameras for viewing, the nvcompositor is enough, no needs to combine the batch.

We have done all the work with nvstreammux to reduce extra memory copy. Please use the DeepStream nvstreammux to construct your DeepStream application pipeline.

The Jetson multimedia GStreamer plugins have been switched from NvBuffer to NvBufSurface since JP5.0. The Jetson GStreamer plugin nvarguscamerasrc can work seamlessly with DeepStream plugins. You don’t need to do batching by yourself any more. Please follow DeepStream usage only.

So the main reason we are using one argus source per all cameras is to have full control over frame synchronization, including lighting settings. We use our own algorithm to define how sensor settings should be applied. This won’t be possible with the nvarguscamerasrc-like approach, as we will loose the ability to control our sensors in unified way.

To be fair it’s becoming a bit frustrating that seemingly trivial operation that used to work on JP4 devices is not working on JP5 and there is no decent solution for that except for going to 1-batched approach.

Understand. But the camera settings have nothing to do with the video buffers. It is not a necessary to batch videos inside nvarguscamerasrc.

The problem is more than just camera settings. It’s the whole architecture of camerasrc that was build around it for years.
And keep in mind that we basically use 4 sensors, we use nvinfer in same pipeline. And on top of that we generate our own videos using gstreamer pipeline with nvvidconv involved there. Meaning VIC in our system is quite overloaded. Having only one copy inside camerasrc would really make a difference for us (with such a VIC load frame copy is not for free at all).

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.