I aim to run multiple inference models on the same GPU (2080Ti in this case) in parallel. Specifically, I instantiate two host threads each with its own stream and execution context i.e., two different models running asynchronously. However, when profiling with nvvp, I noticed that the majority of the layers cannot be overlapped and therefore the performance gain is minor. I understand that this is documented in: https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#streaming.
So my question is to see if there is a way to improve some of the layers’ overlap even at the cost of reducing the individual networks run-time.
Any hints would be helpful.