Why enqueueV2() call in different threads can't execute concurrently?

We have 3 trt models which use the same image input to inference. The 3 inference outputs are needed simultaneously for next processing. So, Each model is loaded in different thread and has it own engine and context.
And we find that the whole time cost of concurrent enqueueV2() call in 3 threads is equal to the sequential enqueueV2() calls for 3 models in one thread . It seems that the multi-thread does not increase the performance. Why?

The application runs in docker container.

Hardware: RTX 3090
Cuda: 11.0
Tensort: 8.0.4
OS: Ubuntu 18.04
Docker: 19.03

Moved to TensorRT forum.


Could you please share with us a minimal issue repro script/model to try from our end for better debugging.

Thank you.