We have 3 trt models which use the same image input to inference. The 3 inference outputs are needed simultaneously for next processing. So, Each model is loaded in different thread and has it own engine and context.
And we find that the whole time cost of concurrent enqueueV2() call in 3 threads is equal to the sequential enqueueV2() calls for 3 models in one thread . It seems that the multi-thread does not increase the performance. Why?
The application runs in docker container.
Hardware: RTX 3090
Cuda: 11.0
Tensort: 8.0.4
OS: Ubuntu 18.04
Docker: 19.03