I am using Tensorrt C++ api and I kept separate runtime context and cuda stream for running models parallelly with threads on Jetson Xavier. But the performance is actually slower than what I achieved with serial execution. Invoking 8 mobilenetv2 models with thread took on an average 160ms while in serial took 110ms. I think with threads the model is getting invoked concurrently but no streams are running in parallel. I also tried flags from blog https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/ but the results are similar.
Nvvp profiler results are as follows.
Serial Invocation of 8 models.
Threading invocation with 8 threads.