When I’m trying to run two tensor rt engines on two different GPU, there is always some latency for the second GPU to start inference. Is it possible to eliminate the latency?
Environment
TensorRT Version: 7.0 GPU Type: RTX 2080 Ti and Titian RTX Nvidia Driver Version: 440.64.00 CUDA Version: 10.2 CUDNN Version: 7.6.5 Operating System + Version: Ubuntu 18.04
Detail
When I’m trying to run the same inference model on two different GPU, it seems that they can not be running at the same time, there is always some latency for the second GPU to start inference. My code is something like this,
When I see the Cuda profiling of this part, it seems that copying input to the second device actually happened a little bit later than the first one. However, when I remove the enqueue() function, the memory copy happened parallelizely. Do you have any idea why this happened? Are there any possible to solve it?
The attachment is the script and model I use to inference. There are two more plugin files I cannot upload because it ends with .cu. If you need please let me know how to upload them and I’ll do. Thanks.
Hi @qiuyunzhe94,
Looks like you missed to attach the model.
Also you can zip all files in one folder and can share it either via IM or on drive.
Thanks!
I think I did put two models into two separate devices, and use two different CudaStream to control them. If you look into the profiling figure, the two enqueue actually happened parallelized in most time, not one after the other. Isn’t this multi-threading? If not, how to make it multi-threading? The link you gave didn’t provide much information on how to make model multi-threading.
Enqueue operation is CPU call not related to GPU stream.
Context 1 execution start at the same time code calls first enqueue and same is the case with context 2.
If user really want to start inference at the same time, starting a new thread is always the choice.
Please refer to below links in case it helps:
Alternatively, you can use Deepstream to run multiple models.