Inference Time When Using Multi Stream in TensorRT is Much Slower than a Single One

Description

A clear and concise description of the bug or issue.

Environment

TensorRT Version: 7.2.3
GPU Type: Tesla T4
Nvidia Driver Version:440.44
CUDA Version: 10.2
CUDNN Version: 8.0
Operating System + Version: Centos7
Python Version (if applicable): 3.6
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.6.0
Baremetal or Container (if container which image + tag):

Hey:
When I am using multi stream which has an single context ,the inference speed is much slower than a single stream. I used nvprof to observe the gpu trace,each stream is executed alternately,not the parallel effect I want.
The question is if the multi stream are executed serially on the GPU. And how can i get the fastest speed if i have several engine that can inference meanwhile.

Hi,

You may need to create multiple IExecutionContext and assign each IExecutionContext with 1 CUDA stream when calling enqueueV2. They are independent of each other.

BTW, we also have a Triton server built on top of Tensorrt, it has a built-in schedule to handle multiple engines/models.

Thank you.

I do create multiple IExecutionContext and assign each IExecutionContext with single CUDA stream, but the result is what I described . However, when I bind each IExecutionContext with different GPU, the inference time is close to that of a single stream.
Thanks.

1 Like

Q: How do I use TensorRT on multiple GPUs?

A: Each ICudaEngine object is bound to a specific GPU when it is instantiated, either by the builder or on deserialization. To select the GPU, use cudaSetDevice() before calling the builder or deserializing the engine. Each IExecutionContext is bound to the same GPU as the engine from which it was created. When calling execute() or enqueue(), ensure that the thread is associated with the correct device by calling cudaSetDevice() if necessary.

13.3.2. Within-Inference Multi-Streaming

In general, CUDA programming streams are a way of organizing asynchronous work. Asynchronous commands put into a stream are guaranteed to run in sequence but may execute out of order with respect to other streams. In particular, asynchronous commands in two streams may be scheduled to run concurrently (subject to hardware limitations).

In the context of TensorRT and inference, each layer of the optimized final network will require work on the GPU. However, not all layers will be able to fully use the computation capabilities of the hardware. Scheduling requests in separate streams allows work to be scheduled immediately as the hardware becomes available without unnecessary synchronization. Even if only some layers can be overlapped, overall performance will improve.

13.3.3. Cross-Inference Multi-Streaming

In addition to the within-inference streaming, you can also enable streaming between multiple execution contexts. For example, you can build an engine with multiple optimization profiles and create an execution context per profile. Then, call the enqueueV3() function of the execution contexts on different streams to allow them to run in parallel.

Running multiple concurrent streams often leads to situations where several streams share compute resources at the same time. This means that the network may have less compute resources available during inference than when the TensorRT engine was being optimized. This difference in resource availability can cause TensorRT to choose a kernel that is suboptimal for the actual runtime conditions. In order to mitigate this effect, you can limit the amount of available compute resources during engine creation to more closely resemble actual runtime conditions. This approach generally promotes throughput at the expense of latency. For more information, refer to Limiting Compute Resources.

It is also possible to use multiple host threads with streams. A common pattern is incoming requests dispatched to a pool of waiting for worker threads. **In this case, the pool of worker threads will each have one execution context and CUDA stream. Each thread will request work in its own stream as the work becomes available. Each thread will synchronize with its stream to wait for results without blocking other worker threads.