Like cudaMemcpy is slow on orin nx

I am using cudaMemcpy in one thread, and it affects the execution of TensorRT streams in other models. I suspect it might be due to streams. The specific description is as follows:

  1. In one thread, I use cudaMemcpy and NPP to perform YUV to RGB conversion.

  2. In another thread, there are TensorRT and other deep learning models.

     The problem occurs when using `cudaMemcpy`, as it requires waiting for the execution of the TensorRT and other deep learning models to complete. How can I resolve this issue? 
     I use `cudaMemcpy` during YUV to RGB conversion, and I noticed that `cudaMemcpy` becomes slow when loading deep learning models for inference. Additionally, I mentioned that these two processes are not directly related in the code, as the images used for inference are loaded separately and not directly linked to YUV to RGB conversion.
    


i use cudaMemcpyAsync and stream is same.

where i use :

virtual void OnImgNotify(const CMediaDataInfo& aDataInfo) override
{
double post = (double)cv::getTickCount();
printf(“img size: %dx%d, timestamp:%.3f, %d\n”, aDataInfo.width, aDataInfo.height, aDataInfo.timestamp, aDataInfo.linesize[0]);
printf(“%d\n”, aDataInfo.linesize[1]);

  //复制出一帧的YUV数据,这里假设为yuv420p的格式,并且step和图像的宽一样去复制
  gMutexYUV.lock();
    nvtxRangePushA("OnImgNotify");
    // COLS= aDataInfo.width;
    // ROWS= aDataInfo.height;
    if(COLS!=aDataInfo.width||ROWS!=aDataInfo.height) std::cout<<"decode output yuv with different size of init input!";
    linesize[0] = aDataInfo.linesize[0];
    linesize[1] =aDataInfo.linesize[1];
    linesize[2] =aDataInfo.linesize[2];

    //需要再memcpy前,先显式将空间对其到主机端,否则再加载onnx模型后会导致错误的发生


    cudaStreamAttachMemAsync(NULL,gYUV[0],0,cudaMemAttachGlobal);
    cudaStreamAttachMemAsync(NULL,gYUV[1],0,cudaMemAttachGlobal);
    cudaStreamAttachMemAsync(NULL,gYUV[2],0,cudaMemAttachGlobal);

    cudaMemcpyAsync(gYUV[0],aDataInfo.datas[0],COLS * ROWS  ,cudaMemcpyHostToDevice,stream);
  cudaMemcpyAsync(gYUV[1],aDataInfo.datas[1],COLS * ROWS/4,cudaMemcpyHostToDevice,stream);
  cudaMemcpyAsync(gYUV[2],aDataInfo.datas[2],COLS * ROWS/4,cudaMemcpyHostToDevice,stream);
    
    cudaStreamSynchronize(stream);
    nvtxRangePop();
  gMutexYUV.unlock();
    post = (double)cv::getTickCount() - post;
    std::cout << "OnImgNotify time :" << post*1000.0 / cv::getTickFrequency() << " ms \n";
    // YUV2BGRNpp();
}

where i init:

void initYuv2bgr(int w,int h)
{
//当前step等待后续给出,而不是自己计算
COLS =w;
ROWS =h;
//用于中转的yuv空间
cudaMallocManaged(&gYUV[0],COLS * ROWS);
cudaMallocManaged(&gYUV[1],COLS * ROWS/4);
cudaMallocManaged(&gYUV[2],COLS * ROWS/4);

cudaMallocManaged(&manageBGR, COLS * ROWS * 3);
cudaStreamCreate(&stream);

}

Hi,

Do you use the same stream to run TensorRT?

Thanks.

I’ve used the new stream elsewhere via a new variable, and create it。but not sure if they’re the same, I can’t figure out the number of the stream or something.

Hi,

Could you open the ‘CUDA HW (Orin)’ and share the screenshot with us?
Thanks.

HI,


Hi,i am sure use the differentstream.

Hi,

It looks like you use the cudaMallocManaged memory.

You don’t need to do the memcpy manually.
GPU driver will help to synchronize this kind of memory.

Thanks.

Because i really need copy to get new one.

Hi,

So based on the use case, are you only need the cuda malloc memory?
Will the CPU access the gYUV buffer?

Unified memory sometime might be slower since it stores data on both CPU and GPU.
It also requires some overhead since the GPU drivers need to do some underlying synchronization.
But it benefits when you read/write the same buffer with CPU and GPU frequently.

Thanks.

In fact, I have solved this problem with locked page memory or UVA, but my question is that deep learning reasoning should not affect my copy of other parts in the case of different streams.

Hi,

Unified memory actually have two copies on CPU and GPU separately.
Although user don’t need to take care the coherence, it do cause some overhead since GPU driver need to synchronize it in the background.

A possible reason is that the two use case need to share the IO bandwidth.
Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.