D2D copy issue in Triton

Please provide complete information as applicable to your setup.

**• Hardware Platform (Jetson / GPU): GPU
**• DeepStream Version: 7.0
• JetPack Version (valid for Jetson only)
• TensorRT Version
**• NVIDIA GPU Driver Version (valid for GPU only): L4
• Issue Type( questions, new requirements, bugs)
**• How to reproduce the issue ?
We noticed that in Triton Inference Server, no CUDA kernels are running while a Device-to-Device (D2D) memory copy is taking place. However, in DeepStream, CUDA kernels can run concurrently with Device-to-Device memory transfers. Why does this difference occur between Triton and DeepStream?

Since triton is oepnsource, you can check the code. could you elaborate on “CUDA kernels can run concurrently with Device-to-Device memory transfers”? what kind of CUDA kernels? In Which DeepStrem did you observe this?

Below is deepstream profile, where D2D memcopy and ConvertNv12BLtoNV12 are happening concurrently.

where as in Triton no CUDA kernels are not running when D2D memcopy is happening as shown below.

it seems you are testing tritonserver. this issue would be outside of DeepStream. could you ask this in triton forum? Thanks!


From the screenshot, there are already CUDA kernels running when doing D2D memcopy. you can check trtion code with the function name.