Task on DLA blocks memory access


We have a program that runs multiple TensorRT models concurrently. Each engine is on its own stream (created with cudaStreamNonBlocking), every cudaMemcpy is a cudaMemcpyAsync, and every H->D or D->H transfer uses pinned host memory. Thus, when the models are placed on the GPU, it works correctly and we observe nicely concurrent operation.

However, when one of the engine has its layers placed on the DLA, memory transfers that shouldn’t block suddenly do. Any cudaMemcpy, async or not, pinned or not, blocks while the DLA task is running. As soon as the DLA finishes its batch, the memcpy goes through. This makes the DLA impossible to use with our program.

If this is a known issue, or I’m missing something, please let me know. If it doesn’t sound familiar, I can write up a minimal program to reproduce the issue and post NSight Systems logs.



YES, it will be good if you can share a sample with us.

Oh jeez.

In the course of writing a minimal POC, I discovered that recording events (which nsight systems does) can serialize stream execution: https://devtalk.nvidia.com/default/topic/471043/cuda-programming-and-performance/stream-concurrency-or-lack-thereof-on-gtx-480/post/3349391/#3349391

Works fine when I don’t run the profiler.

How many hours of my life down the drain…?