Optimal Trt inference using threads/processes for peoplenet model for

kai5gabriel · July 30, 2021, 9:55am

Observation:Same inference speed when running two parallel processes vs running it as threads inside a process

Environment

TensorRT Version: 7.2.3.4
GPU Type: Geforce MX130 (2GB)
Nvidia Driver Version: 465.19.01
CUDA Version: V11.1.105
CUDNN Version: V8
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): python 3.6
TensorFlow Version (if applicable): NA
PyTorch Version (if applicable): 1.8.1+cu111
Baremetal or Container (if container which image + tag): NA
GPU_Utilization_with_threads.xlsx (10.0 KB)

Brief explanation of the experiments available in the attached excel sheet.

Experiment 1 in excel sheet showed no surprises. Single model taking 110ms with 340 mb GPU memory loading a trt model
Experiment 2 we loaded the trt model and did the inference on 2 threads using the same preloaded trt model to mimic 2 camera streams inside a process. The inference time shot up to 160ms as there is context switching but the gpu mem remained same.
Experiment 3 we ran two processes with the camera processing on the main thread itself. Ie 2 processes reading video frames in a while loop. Here the memory doubled as cuda memory was allocated independently in each process, but the inference time was similar to what is seen with threading.

Questions based on the above experiment:
1)Is the recommended approach to doing inference on multiple camera channels using same model is to use threading? Our goal is to achieve maximum parallelism along with reducing the cuda memory overheads?
2)How will the behavior of experiment 3 be when we have GPU segmentation like MiG using Ampere GPUs. I am assuming the process 2 will not have to wait in case it is accessing a different segment of cuda memory?
3) Is Triton Inference server recommended in this context, say we have to process 100 camera streams on a single physical server loading say 10 unique models ie 10 models each running on 10 channels each. Is the same model loaded loaded and reused to avoid the inference overhead?

NVES · July 30, 2021, 10:07am

Hi,
We recommend you to raise this query in TRITON forum for better assistance.

Thanks!

Topic		Replies	Views
Parallel execution of several trt contexts on one GPU TensorRT onnx	1	1199	August 7, 2023
Multithread does not improve inference performance with tensorrt models TensorRT tensorrt	2	1181	May 11, 2021
Inference Time When Using Multi Stream in TensorRT is Much Slower than a Single One TensorRT tensorrt	5	2484	March 30, 2023
Tensorrt Threads affect each other during multithreaded inference TensorRT tensorrt	16	1402	September 6, 2024
Speeding up multi-threaded C++ program of TensorRT models TensorRT tensorrt	7	1353	February 20, 2025
Latency when running TensorRT engine on two GPU TensorRT	9	1239	August 24, 2020
Multithread inference Jetson Xavier NX tensorrt	4	836	August 29, 2021
Running Real-Time Instance Segmentation with Local GPUs TensorRT tensorrt , camera , ros , python , cudnn	2	62	February 18, 2025
how to run trt in multithreading？ Jetson TX2	15	7972	October 18, 2021
Run inference on a batch of images & parallel inference using cuda on python threads TensorRT tensorrt , cuda	6	2299	January 6, 2022

Optimal Trt inference using threads/processes for peoplenet model for

Environment

Related topics