Multithreaded tensorRT performance drops dramatically

shangping.guo · July 27, 2021, 7:52pm

Please provide the following info (check/uncheck the boxes after clicking “+ Create Topic”):
Software Version
[ x] DRIVE OS Linux 5.2.0
DRIVE OS Linux 5.2.0 and DriveWorks 3.5
NVIDIA DRIVE™ Software 10.0 (Linux)
NVIDIA DRIVE™ Software 9.0 (Linux)
other DRIVE OS version
other

Target Operating System
Linux
QNX
other

Hardware Platform
NVIDIA DRIVE™ AGX Xavier DevKit (E3550)
NVIDIA DRIVE™ AGX Pegasus DevKit (E3550)
other

SDK Manager Version
1.6.0.8170
1.5.1.7815
1.5.0.7774
other

Host Machine Version
native Ubuntu 18.04
other

Hello, I had profiled our application with two approaches: multiple process + multithreaded vs single process with multithreading on Pegasus, the performance difference is huge.

There are 7 threads with tensorRT, each one with a separate execute context, a separate cuda stream. The single thread gpu time (including async memory transfer and kernel launch) is all around 20ms. one of them are trivial and not included here.

With multiple process, we had 1,3,1,1 threads in 5 processes, the execution time is 59, (95, 95, 119), 57, 57ms. However if we put all of them on a single process, the time would be 190, (112, 112, 124), 190, 190ms.

I read many doc about the best practice and also did some simple tests on multithreading. The multithreading concurrent is quite unpredictable, sometimes it has very good concurrent overlap (with <=3 threads), sometimes no at all (even with 2 threads), sometimes could be worse than serialization.

We are not using MPS since it is not available on pegasus (not sure about it). Per my understanding, the multiprocess cannot share the GPU and it uses context switching. Multithreading using separate stream share the GPU, but it seems that the sharing is very poor. My simple test program with Nsight nsys shows that with 4 or 8 threads the overlap takes quite a bit time and also it all depends (you cannot guarantee similar overlaps for the same code).

I also tested limiting the number of threads accessing the GPU at the same time, so far the best is limiting 1 thread a time, which yield about 110ms for each threads, which is still worse than the multiprocess approach)

So my question is: how to improve the multithreading performance? What is the best practice anyway? Why sharing the GPU is worse than not sharing? As stated in Nvidia document, multistream sharing is automatic and avoids event synchronization and could be better, but in most cases, it is on the contrary side.

Any input? Thanks

SivaRamaKrishnaNV · July 28, 2021, 1:53am

Dear @shangping.guo,
MPS is not supported on DRIVE AGX platform. How is the GPU occupancy for a single thread? Could you share reproducible code and nsight results to get more insights?

shangping.guo · July 28, 2021, 6:57pm

Hello @SivaRamaKrishnaNV ,

Thanks for the reply. Currently we are having issues using nsight system on pegasus (cuda injection failure). So I do not have sufficient information on the gpu occupancy. But it seems <50% since if I put two threads running on the GPU, it only increases 2ms.
If you can help on the nsight issue I could be able to provide you more details

Thanks
Shangping

VickNV · July 28, 2021, 10:00pm

Hi @shangping.guo , is there any topic about this? If not, could you create one for it? Thanks.

shangping.guo · July 29, 2021, 12:38am

yes, create a topic: Nsys cannot collect cuda information on Drive OS 5.1

shangping.guo · September 29, 2021, 2:37am

Hello @SivaRamaKrishnaNV @VickNV @kayccc Finally I am able to profile GPU performance using nsight system with drive OS 5.2.6 and we can finally continue this topic now.
I first run one thread with TensorRT (separate execution context, separate cuda stream) with 14ms.

Then I add a second thread to the process with TensorRT (separate execution context and cuda stream), the time increase to 112ms

The zoom-out graph:

We can clearly see that each tensorRT kernel launch (inside the enqueue) will need a pthread_mutex_lock, which is the main reason to slow down the latency. Apparently this is far from optimized. This indicates that any time only one sub-kernel is running since it needs to gain the lock. What would be the correct way to do this?

Thanks

By the way, if you need the report on this, I can send to you.

shangping.guo · September 30, 2021, 4:09pm

@VickNV Can you take a look at my post above? Thanks

VickNV · September 30, 2021, 5:59pm

As Multithreaded tensorRT performance drops dramatically - #3 by SivaRamaKrishnaNV, Please help to provide reproducing code/steps for us to easily look into it. Thanks.

shangping.guo · September 30, 2021, 6:23pm

@VickNV This is the profiling of our product software, which is complicated and large. I really do not know if we can do the reproduce. Probably we can make a simple test code, but it also takes quite a few effort. What is your opinion or any simple methods?

VickNV · September 30, 2021, 6:27pm

Can you observe the issue with any of our sample applications?

shangping.guo · September 30, 2021, 6:29pm

@VickNV Which sample applications? for tensorRT? where to download it? Thanks

VickNV · September 30, 2021, 6:42pm

I mean if you can observe the issue based on any of the samples under /usr/src/tensorrt/samples, it will be easier.

shangping.guo · September 30, 2021, 6:43pm

@VickNV Thanks. I will check them out.

shangping.guo · October 1, 2021, 3:46am

@VickNV I tried the sample code sampleCharRNN and add a simple multithread tests. The performance for 1 thread, 2 threads and 5 threads. I observed that synchronization is needed when multiple tensorRT are running. Each pthread_mutex_lock will increase the latency. But the sample has much less synchronization than in our software, so the performance degradation is not significant.
Single thread (zoomed for one single kernel launch)

Two threaded (zommed for the same kernel launch)

from 1ms (single) to 4.9ms (two threads).

SivaRamaKrishnaNV · October 4, 2021, 10:02am

Dear @shangping.guo,
Could you save the profiling result and attach here to load locally(User Guide :: Nsight Systems Documentation)

shangping.guo · October 4, 2021, 3:30pm

@SivaRamaKrishnaNV Thanks, please see attached profiling results
report_charrnn_mt2.qdrep (1.2 MB)
The report for our software profiling cannot be uploaded, maybe too large (5.7MB)

VickNV · October 6, 2021, 11:58pm

Please share your change on the sample for this result and point out the issue in the report. Thanks.

shangping.guo · October 7, 2021, 3:05am

sampleChar.tar (50 KB)
I only make the main as a function, and then generate multiple threads on it. see the attachement
@VickNV

VickNV · October 22, 2021, 9:02pm

I guess such cpu overhead in drivers is still unavoidable using multi-threaded with TensorRT. The impact on determining the fastest kernel (mentioned in Developer Guide :: NVIDIA Deep Learning TensorRT Documentation) may need to be considered.

shangping.guo · October 22, 2021, 9:28pm

@VickNV Thanks for the support. I implemented a MPS-like mechanism and solved the issue on pegasus

Topic		Replies	Views
TensorRT unnecessary synchronization in multi-GPU system TensorRT tensorrt , performance , synchronization	7	1496	January 23, 2023
Multi-process running tensorRT Jetson AGX Xavier tensorrt	5	1631	October 18, 2021
how to run trt in multithreading？ Jetson TX2	15	8108	October 18, 2021
Cuda multithreading and stream problems generic system issues CUDA Programming and Performance	4	3404	August 15, 2008
Tensorrt multi gpu with multi threads TensorRT	1	1141	February 18, 2022
What is the best way to run multiple TRT threads on multiple GPU with each context process same video frame? TensorRT	0	698	June 17, 2019
tensorRT5 inference speed slown down in multithread application. TensorRT	2	1356	December 3, 2019
multiGPU poor performance up to 10x lowest performance in multiGPU CUDA Programming and Performance	14	10848	January 18, 2008
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4251	May 13, 2010
Weird multiGPU performance About 10 times slower than single GPU CUDA Programming and Performance	10	3992	November 25, 2009

Multithreaded tensorRT performance drops dramatically

Related topics