I am using cudaMemcpy in one thread, and it affects the execution of TensorRT streams in other models. I suspect it might be due to streams. The specific description is as follows:
In one thread, I use cudaMemcpy and NPP to perform YUV to RGB conversion.
In another thread, there are TensorRT and other deep learning models.
The problem occurs when using cudaMemcpy, as it requires waiting for the execution of the TensorRT and other deep learning models to complete. How can I resolve this issue? I might know the reason is related to streams, but I cannot modify the TensorRT part of the code. Please advise on how to solve this.
ps：If i can change TensorRT part of code ，have the better way to solve the problem?
I use cudaMemcpy during YUV to RGB conversion, and I noticed that cudaMemcpy becomes slow when loading deep learning models for inference. Additionally, I mentioned that these two processes are not directly related in the code, as the images used for inference are loaded separately and not directly linked to YUV to RGB conversion.
I concur with @striker159. All the more so since the work has apparently already been split across multiple streams. Generally speaking, efficient overlap requires asynchronous operations in all CUDA streams.
I took cudaMemcpy() to be a generic reference to a copy operation, but enlarging the diagram it actually says cudaMemcpy(), not cudaMemcpyAsync().
All of NVIDIA’s embedded platforms use the same physical memory to support both CPU and GPU, correct?
If so, you are effectively shuffling data from one location to a different location within the same physical memory. And the bandwidth of that physical memory is probably quite low, 100 GB/sec maybe? Any bandwidth used for the data copying is going to reduce the bandwidth available to concurrently running code, which could cause that code to run more slowly.
I am not familiar with NVIDIA’s embedded platforms and their performance characteristics. Questions regarding these platforms are best asked in the sub-forums dedicated to them, because the people with experience in performance tuning for them frequent those forums, so better / faster answers are likely. In this case: