cudaMemcpy affect TensorRT stream

I am using cudaMemcpy in one thread, and it affects the execution of TensorRT streams in other models. I suspect it might be due to streams. The specific description is as follows:

  1. In one thread, I use cudaMemcpy and NPP to perform YUV to RGB conversion.
  2. In another thread, there are TensorRT and other deep learning models.

The problem occurs when using cudaMemcpy, as it requires waiting for the execution of the TensorRT and other deep learning models to complete. How can I resolve this issue? I might know the reason is related to streams, but I cannot modify the TensorRT part of the code. Please advise on how to solve this.
ps:If i can change TensorRT part of code ,have the better way to solve the problem?

What exactly is the issue here? If you attempt to copy some data that is generated on the device, the copy can only begin if generation is complete.

Is the OnImgNotify range in the bottom row unrelated to the ExecutionContext range in the top row?

I use cudaMemcpy during YUV to RGB conversion, and I noticed that cudaMemcpy becomes slow when loading deep learning models for inference. Additionally, I mentioned that these two processes are not directly related in the code, as the images used for inference are loaded separately and not directly linked to YUV to RGB conversion.

What kind of cudaMemcpy() is this? Host->device, device->host, device->device? How memory intensive is the code that is executing concurrently with the cudaMemcpy()?

Keep in mind that any concurrent operations share the available GPU memory bandwidth.

I would try to use cudaMemcpyAsync with pinned host memory.

I concur with @striker159. All the more so since the work has apparently already been split across multiple streams. Generally speaking, efficient overlap requires asynchronous operations in all CUDA streams.

I took cudaMemcpy() to be a generic reference to a copy operation, but enlarging the diagram it actually says cudaMemcpy(), not cudaMemcpyAsync().

I am using jeton orin nx. The size is about 4m. I tried to use mangedMemory and cudaMemcpyAsync, but the result is still the same.The specific code is as follows.

init as follow:

Please do not post code as images.

With pinned memory I was referring to memory allocated with cudaMallocHost.

With cudaMallocManaged, one typically uses cudaMemPrefetchAsync, not cudaMemcpyAsync

All of NVIDIA’s embedded platforms use the same physical memory to support both CPU and GPU, correct?

If so, you are effectively shuffling data from one location to a different location within the same physical memory. And the bandwidth of that physical memory is probably quite low, 100 GB/sec maybe? Any bandwidth used for the data copying is going to reduce the bandwidth available to concurrently running code, which could cause that code to run more slowly.

I am not familiar with NVIDIA’s embedded platforms and their performance characteristics. Questions regarding these platforms are best asked in the sub-forums dedicated to them, because the people with experience in performance tuning for them frequent those forums, so better / faster answers are likely. In this case:

thanks,i do it。