Faster way to cache images on Jetson

sandberg · June 14, 2023, 2:10am

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU)
Jetson

• DeepStream Version
6.2

• JetPack Version (valid for Jetson only)
5.1.1

• TensorRT Version
8.5.2.2-1+cuda11.4

• NVIDIA GPU Driver Version (valid for GPU only)
• Issue Type( questions, new requirements, bugs)
Questions

• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

Is there a better (faster) way to get surface’s block linear layout frames into pitch linear CUDA array? Currently for each batch I’m mapping the surface with NvBufSurfaceMapEglImage and then for each stream I call cudaGraphicsEGLRegisterImage and cudaGraphicsResourceGetMappedEglFrame and then cudaMemcpy2DFromArrayAsync to copy the data and then finally calling cudaGraphicsUnregisterResource for each stream and at the end of it all NvBufSurfaceUnMapEglImage. From my testing this can be very slow (around 20ms on average for 6 streams with a min of 5ms and a max of 37ms) if we run inference at the same time. If we don’t run any inference min/max is about the same but average drops down to close to min at 5ms. This severely reduces the frame rate we’re able to achieve because we need to be able to cache and later access the frame data. 62% of the total time is spent calling cudaGraphicsUnregisterResource, and 16% of the time is spend calling cudaGraphicsEGLRegisterImage.

junshengy · June 14, 2023, 7:43am

What is your goal of convert NvBurfaceSurface from block linear to pitch linear ?

If you want convert the memory type.just try this sample code below

This is api document. and you can refer sample code which named 02_video_dec_cuda at /usr/src/jetson_multimedia_api

    NvBufSurf::NvCommonAllocateParams params = {0};
    /* Create PitchLinear output buffer for transform. */
    params.memType = NVBUF_MEM_SURFACE_ARRAY;
    params.width = frameWidth;
    params.height = frameHeight;
    params.layout = NVBUF_LAYOUT_PITCH;
    params.memtag = NvBufSurfaceTag_VIDEO_CONVERT;
    params.colorFormat = NVBUF_COLOR_FORMAT_RGBA;
    ret = NvBufSurf::NvAllocate(&params, 1, &dstDmaFd_);

    NvBufSurf::NvCommonTransformParams transform_params;
    transform_params.src_top = 0;
    transform_params.src_left = 0;
    transform_params.src_width = frameWidth;
    transform_params.src_height = frameHeight;
    transform_params.dst_top = 0;
    transform_params.dst_left = 0;
    transform_params.dst_width = frameWidth;
    transform_params.dst_height = frameHeight;
    transform_params.flag = NVBUFSURF_TRANSFORM_FILTER;
    transform_params.flip = NvBufSurfTransform_None;
    transform_params.filter = NvBufSurfTransformInter_Algo3;
  
    /* Perform Blocklinear to PitchLinear conversion. */
    ret = NvBufSurf::NvTransform(&transform_params, surfaceFd, dstDmaFd_);

sandberg · June 14, 2023, 11:52pm

Hi @junshengy, my aim is to be able to convert it to an OpenCV Mat on demand. I don’t technically need to convert it to pitch linear at the time of buffering the frames. I only need to do it when I eventually want to convert it to an OpenCV Mat. But I don’t need to do that for every frame so in theory I could buffer block linear memory and only convert it when required.

Basically what we’re doing is buffering all the frames and when we detect and event we go back in history and request OpenCV Mat for those time points.

Would the most efficient approach be to create NvBufSurfaceCreate and then copy over the current batch with NvBufSurfaceCopy. At the time of requesting OpenCV mat I convert block to pitch and then copy it into an OpenCV buffer?

junshengy · June 16, 2023, 5:18pm

If you only want convert it to OpenCV on demand, there is a samle code to do it.

int videoBlur(int fd) {
  NvBufSurface* surface = nullptr;

  if (NvBufSurfaceFromFd(fd, (void**)(&surface)) < 0) {
    ERROR_MSG("NvBufSurfaceFromFd failed");
    return -1;
  }

  cv::Mat in_mat;
  cv::Rect crop_rect;

  /* Map the buffer so that it can be accessed by CPU */
  if (surface->surfaceList[0].mappedAddr.addr[0] == NULL) {
    if (NvBufSurfaceMap(surface, 0, 0, NVBUF_MAP_READ_WRITE) != 0) {
      ERROR_MSG("Map for CPU access failed");
      return -1;
    }
  }

  /* Invalidate the cache before CPU access the buffer */
  if (surface->memType == NVBUF_MEM_SURFACE_ARRAY) {
    NvBufSurfaceSyncForCpu(surface, 0, 0);
  }

  /* Map to cv:Mat with CPU address/width/height/stride of video frame.
   * Map to CV_8UC4 as the video frame format is RGBA.
   */
  in_mat = cv::Mat(surface->surfaceList[0].planeParams.height[0],
                   surface->surfaceList[0].planeParams.width[0], CV_8UC4,
                   surface->surfaceList[0].mappedAddr.addr[0],
                   surface->surfaceList[0].planeParams.pitch[0]);

  /* Apply blur on left-top of the image */
  crop_rect = cv::Rect(0, 0, surface->surfaceList[0].planeParams.width[0] / 2,
                       surface->surfaceList[0].planeParams.height[0] / 2);

  /* Apply gaussian blur */
  GaussianBlur(in_mat(crop_rect), in_mat(crop_rect), cv::Size(15, 15), 4);

  /* Flush cache after CPU access the buffer for hardware device access the
   * buffer
   */
  if (surface->memType == NVBUF_MEM_SURFACE_ARRAY) {
    NvBufSurfaceSyncForDevice(surface, 0, 0);
  }

  /* UnMap the buffer */
  if (NvBufSurfaceUnMap(surface, 0, 0)) {
    ERROR_MSG("UnMap failed");
    return -1;
  }

  return 0;
}

You cant got the fd from NvBufSurface surfaceList[batch_index].bufferDesc
The above example can do the conversion from block to pitch.
Remap the hardware buffer can work for your need.

sandberg · June 19, 2023, 11:54pm

Hi @junshengy, what I meant with “on-demand” is that I want to be able to get a cached frame X second ago (not the current batch). The cache doesn’t have to be cv::Mat, it can be any data type that can eventually be converted to a cv::Mat because I cache far more frames than I convert.

I used the code you suggested above a long time ago, but it’s much slower than my current approach.

junshengy · June 20, 2023, 2:59am

From your description, I think what you need is Smart Video Record

deepstream-testsr is an example of Smart Video Record at
/opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/

You can get more tips at deepstream-test5-app

sandberg · June 20, 2023, 4:48am

No, we already use smart recording, but it’s not what we want in this scenario. I’m probably not explaining myself very well.

We need to be able to get an image at the start of the event. But we don’t know that the event has occurred until later (say for example at the end of a track). In order to be able to do this we need to cache all frames. We can then go back in time and select one or more frames from the event. This is very fast on dGPU but it’s slow on Jetson.

We keep a circular buffer of pre-allocated cv::GpuMat memory. On dGPU all we need to do is call nppiNV12ToBGR_8u_P2C3R_Ctx to convert the NV12 surface to cv::GpuMat BGR memory. Because Jetson is using block linear layout we have to do a lot of transformations. On Jetson we have to map the NV12 surface to an EGLImage and then copy this data into CUDA memory which can then be used to call nppiNV12ToBGR_8u_P2C3R_Ctx. These extra steps have a large overhead.

We need to always cache all frames because we never know when things will happen. But most of these frames will just be discarded because nothing happened. So an acceptable workaround would be to cache a more Jetson friendly surface format and then only convert them when they’re needed. When I find some time I’ll try to create a copy of the NvBufSurface itself and cache that.

junshengy · June 21, 2023, 6:28am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Use CUDA kernel function is the fastest way to convert the NV12 surface to cv::GpuMat BGR memory on Jetson.

But we don’t have a sample like this currently.

You may need to refer to the documentation of CUDA

system · July 18, 2023, 3:08am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
OpenCV Mat to NvBufSurface (to use in NvBufSurfTransform) DeepStream SDK	16	3554	October 12, 2021
Deepstream DeepStream SDK gstreamer , jetson , deepstream	14	451	July 9, 2024
[deepstream-python] how to blur object using nvbufsurface DeepStream SDK	9	430	November 14, 2023
Memory leak even after calling NvBufSurfaceUnMap DeepStream SDK deepstream	15	64	March 20, 2025
Deepstream-app not running when more than 12 cams or 3 models DeepStream SDK deepstream	10	27	January 8, 2025
Error generated while running the code after connecting the camera Jetson Xavier NX gstreamer , nvbugs	45	1253	January 2, 2024
NvBufSurface and OpenCV DeepStream SDK opencv , gstreamer , deepstream	8	1848	August 25, 2022
Using Cuda filters on cuda::GpuMat obtained from NvBufSurface DeepStream SDK opencv , cuda , gstreamer	8	2442	July 27, 2021
Help \| Mapping NVMM buffers with CUDA using EGLImageKHR and CUeglFrame, Jetson Nano running 4.5.1 Jetson Nano cuda , gstreamer	4	762	May 13, 2022
Jetson/Python: Gpu mapped buffer from gstreamer DeepStream SDK python , jetson , deepstream	9	87	November 9, 2024

Faster way to cache images on Jetson

Related topics