We have a program that runs multiple TensorRT models concurrently. Each engine is on its own stream (created with cudaStreamNonBlocking), every cudaMemcpy is a cudaMemcpyAsync, and every H->D or D->H transfer uses pinned host memory. Thus, when the models are placed on the GPU, it works correctly and we observe nicely concurrent operation.
However, when one of the engine has its layers placed on the DLA, memory transfers that shouldn’t block suddenly do. Any cudaMemcpy, async or not, pinned or not, blocks while the DLA task is running. As soon as the DLA finishes its batch, the memcpy goes through. This makes the DLA impossible to use with our program.
If this is a known issue, or I’m missing something, please let me know. If it doesn’t sound familiar, I can write up a minimal program to reproduce the issue and post NSight Systems logs.