Description
I have 2 tensorRT engines, where the output of the first engine is going through a CPU post process and then goes into the second engine.
while the second engine is running, a new stream is pushing an inference to the first engine and so forth.
As you can see in the profiling viewer nvvp, the first Engine kernels are occupying the context and not letting the second Engine kernels to kick in. (I’ve added time labels for each model as you can see in green color)
when the first model run solo, I’ve inspected 23ms of inference.
when the second model run solo, I’ve inspected 7ms of inference.
Running them both together, it’s coming up to 60ms each.
The question is how can i optimize both engines to run concurrently during the build phase and avoid those huge gaps in the lightweight engine?
Environment
Both of the engines created separately over the NX-Xavier with ubuntu 18.04 with aarch64.
ii libnvinfer-plugin7 7.1.3-1+cuda10.2 arm64 TensorRT plugin libraries
ii libnvinfer7 7.1.3-1+cuda10.2 arm64 TensorRT runtime libraries
nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2019 NVIDIA Corporation
Release version 10.2.89 (21)