Fast copy of DMA buffers via NvBufferTransform

What is preferred approach to copy one DMA buffer to another ?
I am having performance issues when I try to copy one 1920x1080 block linear buffer to another via NvBufferTransform. It takes about 5ms, which is unacceptable for my use case. I could live with 1ms or less. I figured that the copy will be very fast, something like CUDA device to device copy, but I don’t get performance like that.

I am experiencing this on AGX Xavier, 30W mode, with # R32 (release), REVISION: 4.3. However, I am interested in this performance question across all Jetson devices.

Minimal example is below. I removed all the error checking to make it shorter. Copy itself is successful, the destination buffer gets the right data, but it is too slow (5ms). Using sessions for NvBufferTransform didn’t help.

Do I need some other combination od parameters for NvBufferTransform ? Or something else to get fast DMA copy ?

#include <chrono>
#include <iostream>
#include <cstring>
#include "nvbuf_utils.h"

int dma_copy(int dst_fd, int src_fd)
    NvBufferTransformParams transform_params;
    transform_params.transform_flag = NVBUFFER_TRANSFORM_FLIP;
    transform_params.transform_flip = NvBufferTransform_None;
    return NvBufferTransform(src_fd, dst_fd, &transform_params);

int dma_create(int* fd, int width, int height)
    NvBufferCreateParams params;
    params.width = width;
    params.height = height;
    params.payloadType = NvBufferPayload_SurfArray;
    params.memsize = 0;  
    params.layout = NvBufferLayout_BlockLinear;
    params.colorFormat = NvBufferColorFormat_NV12;

    params.nvbuf_tag = NvBufferTag_NONE;
    return NvBufferCreateEx(fd, &params);

int dma_destroy(int* fd) 
   return NvBufferDestroy(*fd);

int main()
    const int width = 1920;
    const int height = 1080;

    int dma_fd1;
    int dma_fd2;

    dma_create(&dma_fd1, width, height);
    dma_create(&dma_fd2, width, height);

    using milli = std::chrono::milliseconds;
    auto start = std::chrono::high_resolution_clock::now();

    dma_copy(dma_fd2, dma_fd1);

    auto finish = std::chrono::high_resolution_clock::now();

    std::cerr << "nvmpi_copy() took "
                << std::chrono::duration_cast<milli>(finish - start).count() << " ms " << std::endl;


Please refer to the topic:

to run VIC at max clock and check again.

If you have multi threads calling NvBufferTransform() in single process, please create session in each thread.