Invoking Tensorrt Model on Jetson Xavier with threads performs slower than invoking in serial manner

I am using Tensorrt C++ api and I kept separate runtime context and cuda stream for running models parallelly with threads on Jetson Xavier. But the performance is actually slower than what I achieved with serial execution. Invoking 8 mobilenetv2 models with thread took on an average 160ms while in serial took 110ms. I think with threads the model is getting invoked concurrently but no streams are running in parallel. I also tried flags from blog https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/ but the results are similar.

Nvvp profiler results are as follows.
Serial Invocation of 8 models.


Threading invocation with 8 threads.

Hi,

Would you mind to check the GPU resource required by one model first?
The GPU utilization can be found in tegrastats.

$ sudo tegrastats

Please noticed that XavierNX has limited GPU resource.
If all the models require more than 99% GPU resource, they need to wait for the resource in turn.

Thanks.