Invoking Tensorrt Model on Jetson Xavier with threads performs slower than invoking in serial manner

I am using Tensorrt C++ api and I kept separate runtime context and cuda stream for running models parallelly with threads on Jetson Xavier. But the performance is actually slower than what I achieved with serial execution. Invoking 8 mobilenetv2 models with thread took on an average 160ms while in serial took 110ms. I think with threads the model is getting invoked concurrently but no streams are running in parallel. I also tried flags from blog but the results are similar.

Nvvp profiler results are as follows.
Serial Invocation of 8 models.

Threading invocation with 8 threads.


Would you mind to check the GPU resource required by one model first?
The GPU utilization can be found in tegrastats.

$ sudo tegrastats

Please noticed that XavierNX has limited GPU resource.
If all the models require more than 99% GPU resource, they need to wait for the resource in turn.