Concurrent tensorRT engines

Description

I have 2 tensorRT engines, where the output of the first engine is going through a CPU post process and then goes into the second engine.
while the second engine is running, a new stream is pushing an inference to the first engine and so forth.
As you can see in the profiling viewer nvvp, the first Engine kernels are occupying the context and not letting the second Engine kernels to kick in. (I’ve added time labels for each model as you can see in green color)

when the first model run solo, I’ve inspected 23ms of inference.
when the second model run solo, I’ve inspected 7ms of inference.
Running them both together, it’s coming up to 60ms each.
The question is how can i optimize both engines to run concurrently during the build phase and avoid those huge gaps in the lightweight engine?

Environment

Both of the engines created separately over the NX-Xavier with ubuntu 18.04 with aarch64.

ii  libnvinfer-plugin7                   7.1.3-1+cuda10.2                    arm64        TensorRT plugin libraries
ii  libnvinfer7                          7.1.3-1+cuda10.2                    arm64        TensorRT runtime libraries

nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2019 NVIDIA Corporation
Release version 10.2.89 (21)

Hi,

The below links might be useful for you.
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#thread-safety

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-priorities

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html

For multi-threading/streaming, will suggest you to use Deepstream or TRITON

For more details, we recommend you raise the query in Deepstream forum.

or

raise the query in Triton Inference Server Github instance issues section.

Thanks!