Inference Time When Using Multi Stream in TensorRT is Much Slower than a Single One

shenjionghang · March 14, 2022, 3:01am

Description

A clear and concise description of the bug or issue.

Environment

TensorRT Version: 7.2.3
GPU Type: Tesla T4
Nvidia Driver Version:440.44
CUDA Version: 10.2
CUDNN Version: 8.0
Operating System + Version: Centos7
Python Version (if applicable): 3.6
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.6.0
Baremetal or Container (if container which image + tag):

Hey:
When I am using multi stream which has an single context ，the inference speed is much slower than a single stream. I used nvprof to observe the gpu trace，each stream is executed alternately，not the parallel effect I want.
The question is if the multi stream are executed serially on the GPU. And how can i get the fastest speed if i have several engine that can inference meanwhile.

spolisetty · March 22, 2022, 1:31pm

Hi,

You may need to create multiple IExecutionContext and assign each IExecutionContext with 1 CUDA stream when calling enqueueV2. They are independent of each other.

BTW, we also have a Triton server built on top of Tensorrt, it has a built-in schedule to handle multiple engines/models.

Thank you.

shenjionghang · March 23, 2022, 6:54am

I do create multiple IExecutionContext and assign each IExecutionContext with single CUDA stream, but the result is what I described . However, when I bind each IExecutionContext with different GPU, the inference time is close to that of a single stream.
Thanks.

FreedomLiX · March 30, 2023, 2:56am

Q: How do I use TensorRT on multiple GPUs?

A: Each ICudaEngine object is bound to a specific GPU when it is instantiated, either by the builder or on deserialization. To select the GPU, use cudaSetDevice() before calling the builder or deserializing the engine. Each IExecutionContext is bound to the same GPU as the engine from which it was created. When calling execute() or enqueue(), ensure that the thread is associated with the correct device by calling cudaSetDevice() if necessary.

FreedomLiX · March 30, 2023, 3:12am

13.3.2. Within-Inference Multi-Streaming

In general, CUDA programming streams are a way of organizing asynchronous work. Asynchronous commands put into a stream are guaranteed to run in sequence but may execute out of order with respect to other streams. In particular, asynchronous commands in two streams may be scheduled to run concurrently (subject to hardware limitations).

In the context of TensorRT and inference, each layer of the optimized final network will require work on the GPU. However, not all layers will be able to fully use the computation capabilities of the hardware. Scheduling requests in separate streams allows work to be scheduled immediately as the hardware becomes available without unnecessary synchronization. Even if only some layers can be overlapped, overall performance will improve.

FreedomLiX · March 30, 2023, 3:17am

13.3.3. Cross-Inference Multi-Streaming

In addition to the within-inference streaming, you can also enable streaming between multiple execution contexts. For example, you can build an engine with multiple optimization profiles and create an execution context per profile. Then, call the enqueueV3() function of the execution contexts on different streams to allow them to run in parallel.

Running multiple concurrent streams often leads to situations where several streams share compute resources at the same time. This means that the network may have less compute resources available during inference than when the TensorRT engine was being optimized. This difference in resource availability can cause TensorRT to choose a kernel that is suboptimal for the actual runtime conditions. In order to mitigate this effect, you can limit the amount of available compute resources during engine creation to more closely resemble actual runtime conditions. This approach generally promotes throughput at the expense of latency. For more information, refer to Limiting Compute Resources.

It is also possible to use multiple host threads with streams. A common pattern is incoming requests dispatched to a pool of waiting for worker threads. **In this case, the pool of worker threads will each have one execution context and CUDA stream. Each thread will request work in its own stream as the work becomes available. Each thread will synchronize with its stream to wait for results without blocking other worker threads.

Topic		Replies	Views
Tensorrt Threads affect each other during multithreaded inference TensorRT tensorrt	16	1358	September 6, 2024
Speeding up multi-threaded C++ program of TensorRT models TensorRT tensorrt	7	1295	February 20, 2025
Parallel execution of several trt contexts on one GPU TensorRT onnx	1	1145	August 7, 2023
how to run trt in multithreading？ Jetson TX2	15	7940	October 18, 2021
Inference Time When Using Multi Stream Multi Context in TensorRT is Slower than a Single One TensorRT tensorrt , cuda , cudnn	1	32	November 30, 2024
TensorRT MultiThread with MultiGPU TensorRT	1	477	February 14, 2023
Batch inference parallelization on tensorrt TensorRT tensorrt , cuda	5	956	May 5, 2021
Is multi threaded execution possible with tensorRT? TensorRT	3	2235	April 13, 2020
Multithread does not improve inference performance with tensorrt models TensorRT tensorrt	2	1175	May 11, 2021
TensorRT Builder timing cache - preventing inaccurate timings due to concurrent GPU use TensorRT tensorrt	3	1114	October 16, 2021