Latency when running TensorRT engine on two GPU

qiuyunzhe94 · July 1, 2020, 6:17pm

Description

When I’m trying to run two tensor rt engines on two different GPU, there is always some latency for the second GPU to start inference. Is it possible to eliminate the latency?

Environment

TensorRT Version: 7.0
GPU Type: RTX 2080 Ti and Titian RTX
Nvidia Driver Version: 440.64.00
CUDA Version: 10.2
CUDNN Version: 7.6.5
Operating System + Version: Ubuntu 18.04

Detail

When I’m trying to run the same inference model on two different GPU, it seems that they can not be running at the same time, there is always some latency for the second GPU to start inference. My code is something like this,

struct Inference
{
    IRuntime* runtime;
    ICudaEngine* engine;
    IExecutionContext* context;
    cudaStream_t stream;
    void* buffer[2];
    int inputIndex, outputIndex;
};
Inference plan[2];

... #initialize the engine on two different device

for(int nDevice = 0; nDevice < 2; nDevice++)
        {
            cudaSetDevice(DEVICE[nDevice]);
            cudaMemcpyAsync(plan[nDevice].buffer[plan[nDevice].inputIndex], fInput + nDevice * offset_input, datasize_input, cudaMemcpyHostToDevice, plan[nDevice].stream);
            plan[nDevice].context -> enqueue(BATCH_SIZE, plan[nDevice].buffer, plan[nDevice].stream, nullptr);
            cudaMemcpyAsync(fOutput + nDevice * offset_output, plan[nDevice].buffer[plan[nDevice].outputIndex], datasize_output, cudaMemcpyDeviceToHost, plan[nDevice].stream);
        }

The above image is the profiling result.

When I see the Cuda profiling of this part, it seems that copying input to the second device actually happened a little bit later than the first one. However, when I remove the enqueue() function, the memory copy happened parallelizely. Do you have any idea why this happened? Are there any possible to solve it?

AakankshaS · July 2, 2020, 4:48am

Hi @qiuyunzhe94,
Can you please share your script and model so that we can help you better.
Thanks!

qiuyunzhe94 · July 2, 2020, 7:07pm

The attachment is the script and model I use to inference. There are two more plugin files I cannot upload because it ends with .cu. If you need please let me know how to upload them and I’ll do. Thanks.

inference.cpp (35.0 KB)

AakankshaS · July 3, 2020, 6:49am

Hi @qiuyunzhe94,
Looks like you missed to attach the model.
Also you can zip all files in one folder and can share it either via IM or on drive.
Thanks!

qiuyunzhe94 · July 4, 2020, 7:34am

Hi @AakankshaS,

I upload all the script to drive. Thanks for your help.

yolov4 script

qiuyunzhe94 · July 9, 2020, 6:06pm

Hi @AakankshaS ,

Is there any update on this issue? Or do I need to provide more information?

Thanks!

SunilJB · August 18, 2020, 5:56am

Hi @qiuyunzhe94,

Sorry for late response.
Since you are not using multi-threading, it is as expected. After first enqueue finish, second enqueue can start at that time.
You can try multi threading for better performance. Please refer to thread safety best practices:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-713/best-practices/index.html#thread-safety

Thanks

qiuyunzhe94 · August 19, 2020, 11:14pm

Hi Sunil,

I think I did put two models into two separate devices, and use two different CudaStream to control them. If you look into the profiling figure, the two enqueue actually happened parallelized in most time, not one after the other. Isn’t this multi-threading? If not, how to make it multi-threading? The link you gave didn’t provide much information on how to make model multi-threading.

Thanks!

SunilJB · August 21, 2020, 9:34am

Hi,

Enqueue operation is CPU call not related to GPU stream.
Context 1 execution start at the same time code calls first enqueue and same is the case with context 2.
If user really want to start inference at the same time, starting a new thread is always the choice.
Please refer to below links in case it helps:

Alternatively, you can use Deepstream to run multiple models.

Thanks

qiuyunzhe94 · August 24, 2020, 11:19pm

Thanks for you reply!

Topic		Replies	Views
Unable to do inference of multiple engines in parallel TensorRT tensorrt , nano	3	1707	May 6, 2022
Batch inference parallelization on tensorrt TensorRT tensorrt , cuda	5	956	May 5, 2021
TensorRT Parallel Inference /concurrent inferecing TensorRT tensorrt	10	4004	October 13, 2022
Tensorrt Threads affect each other during multithreaded inference TensorRT tensorrt	16	1368	September 6, 2024
Inference issue queuing up on one GPU TensorRT tensorrt , cuda , cudnn	1	262	May 31, 2024
how to run trt in multithreading？ Jetson TX2	15	7947	October 18, 2021
Multithread does not improve inference performance with tensorrt models TensorRT tensorrt	2	1175	May 11, 2021
Concurrent tensorRT engines TensorRT jetson	1	393	December 5, 2022
Multiple threads execution with different engines in tensorrt TensorRT tensorrt	3	2468	December 13, 2022
Is multi threaded execution possible with tensorRT? TensorRT	3	2239	April 13, 2020

Latency when running TensorRT engine on two GPU

Description

Environment

Detail

Related topics