Latency when running TensorRT engine on two GPU

Description

When I’m trying to run two tensor rt engines on two different GPU, there is always some latency for the second GPU to start inference. Is it possible to eliminate the latency?

Environment

TensorRT Version: 7.0
GPU Type: RTX 2080 Ti and Titian RTX
Nvidia Driver Version: 440.64.00
CUDA Version: 10.2
CUDNN Version: 7.6.5
Operating System + Version: Ubuntu 18.04

Detail

When I’m trying to run the same inference model on two different GPU, it seems that they can not be running at the same time, there is always some latency for the second GPU to start inference. My code is something like this,

struct Inference
{
    IRuntime* runtime;
    ICudaEngine* engine;
    IExecutionContext* context;
    cudaStream_t stream;
    void* buffer[2];
    int inputIndex, outputIndex;
};
Inference plan[2];

... #initialize the engine on two different device

for(int nDevice = 0; nDevice < 2; nDevice++)
        {
            cudaSetDevice(DEVICE[nDevice]);
            cudaMemcpyAsync(plan[nDevice].buffer[plan[nDevice].inputIndex], fInput + nDevice * offset_input, datasize_input, cudaMemcpyHostToDevice, plan[nDevice].stream);
            plan[nDevice].context -> enqueue(BATCH_SIZE, plan[nDevice].buffer, plan[nDevice].stream, nullptr);
            cudaMemcpyAsync(fOutput + nDevice * offset_output, plan[nDevice].buffer[plan[nDevice].outputIndex], datasize_output, cudaMemcpyDeviceToHost, plan[nDevice].stream);
        }


The above image is the profiling result.

When I see the Cuda profiling of this part, it seems that copying input to the second device actually happened a little bit later than the first one. However, when I remove the enqueue() function, the memory copy happened parallelizely. Do you have any idea why this happened? Are there any possible to solve it?

Hi @qiuyunzhe94,
Can you please share your script and model so that we can help you better.
Thanks!

The attachment is the script and model I use to inference. There are two more plugin files I cannot upload because it ends with .cu. If you need please let me know how to upload them and I’ll do. Thanks.

inference.cpp (35.0 KB)

Hi @qiuyunzhe94,
Looks like you missed to attach the model.
Also you can zip all files in one folder and can share it either via IM or on drive.
Thanks!

Hi @AakankshaS,

I upload all the script to drive. Thanks for your help.

yolov4 script

Hi @AakankshaS ,

Is there any update on this issue? Or do I need to provide more information?

Thanks!

Hi @qiuyunzhe94,

Sorry for late response.
Since you are not using multi-threading, it is as expected. After first enqueue finish, second enqueue can start at that time.
You can try multi threading for better performance. Please refer to thread safety best practices:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-713/best-practices/index.html#thread-safety

Thanks

Hi Sunil,

I think I did put two models into two separate devices, and use two different CudaStream to control them. If you look into the profiling figure, the two enqueue actually happened parallelized in most time, not one after the other. Isn’t this multi-threading? If not, how to make it multi-threading? The link you gave didn’t provide much information on how to make model multi-threading.

Thanks!

Hi,

Enqueue operation is CPU call not related to GPU stream.
Context 1 execution start at the same time code calls first enqueue and same is the case with context 2.
If user really want to start inference at the same time, starting a new thread is always the choice.
Please refer to below links in case it helps:

Alternatively, you can use Deepstream to run multiple models.

Thanks

Thanks for you reply!