Concurrent tensorRT engines

achiya · December 5, 2022, 9:13am

Description

I have 2 tensorRT engines, where the output of the first engine is going through a CPU post process and then goes into the second engine.
while the second engine is running, a new stream is pushing an inference to the first engine and so forth.
As you can see in the profiling viewer nvvp, the first Engine kernels are occupying the context and not letting the second Engine kernels to kick in. (I’ve added time labels for each model as you can see in green color)

when the first model run solo, I’ve inspected 23ms of inference.
when the second model run solo, I’ve inspected 7ms of inference.
Running them both together, it’s coming up to 60ms each.
The question is how can i optimize both engines to run concurrently during the build phase and avoid those huge gaps in the lightweight engine?

Environment

Both of the engines created separately over the NX-Xavier with ubuntu 18.04 with aarch64.

ii  libnvinfer-plugin7                   7.1.3-1+cuda10.2                    arm64        TensorRT plugin libraries
ii  libnvinfer7                          7.1.3-1+cuda10.2                    arm64        TensorRT runtime libraries

nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2019 NVIDIA Corporation
Release version 10.2.89 (21)

NVES · December 5, 2022, 9:37am

Hi,

The below links might be useful for you.
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#thread-safety

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-priorities

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html

For multi-threading/streaming, will suggest you to use Deepstream or TRITON

For more details, we recommend you raise the query in Deepstream forum.

or

raise the query in Triton Inference Server Github instance issues section.

Thanks!

Topic		Replies	Views
Unable to do inference of multiple engines in parallel TensorRT tensorrt , nano	3	1680	May 6, 2022
Tensorrt Threads affect each other during multithreaded inference TensorRT tensorrt	16	1310	September 6, 2024
Latency when running TensorRT engine on two GPU TensorRT	9	1223	August 24, 2020
Can I limit the computational resources consumption at the TensorRT engine building stage? TensorRT tensorrt , cuda , kernel	3	733	August 28, 2023
Multiple threads execution with different engines in tensorrt TensorRT tensorrt	3	2394	December 13, 2022
Is TensorRT safe to create engine & context in one thread, and execute in another thread? TensorRT	1	680	June 5, 2022
Inference Time When Using Multi Stream in TensorRT is Much Slower than a Single One TensorRT tensorrt	5	2431	March 30, 2023
Multi-model parallel inferencing TensorRT	1	353	March 31, 2023
TensorRT multi stream TensorRT	3	2515	February 29, 2024
TensorRT concurrent or parrellel inference in one GPU in jetson platform TensorRT jetson-inference	1	589	June 29, 2023

Concurrent tensorRT engines

Description

Environment

Related topics