Multithreaded tensorRT performance drops dramatically

Please provide the following info (check/uncheck the boxes after clicking “+ Create Topic”):
Software Version
[ x] DRIVE OS Linux 5.2.0
DRIVE OS Linux 5.2.0 and DriveWorks 3.5
NVIDIA DRIVE™ Software 10.0 (Linux)
NVIDIA DRIVE™ Software 9.0 (Linux)
other DRIVE OS version
other

Target Operating System
Linux
QNX
other

Hardware Platform
NVIDIA DRIVE™ AGX Xavier DevKit (E3550)
NVIDIA DRIVE™ AGX Pegasus DevKit (E3550)
other

SDK Manager Version
1.6.0.8170
1.5.1.7815
1.5.0.7774
other

Host Machine Version
native Ubuntu 18.04
other

Hello, I had profiled our application with two approaches: multiple process + multithreaded vs single process with multithreading on Pegasus, the performance difference is huge.

There are 7 threads with tensorRT, each one with a separate execute context, a separate cuda stream. The single thread gpu time (including async memory transfer and kernel launch) is all around 20ms. one of them are trivial and not included here.

With multiple process, we had 1,3,1,1 threads in 5 processes, the execution time is 59, (95, 95, 119), 57, 57ms. However if we put all of them on a single process, the time would be 190, (112, 112, 124), 190, 190ms.

I read many doc about the best practice and also did some simple tests on multithreading. The multithreading concurrent is quite unpredictable, sometimes it has very good concurrent overlap (with <=3 threads), sometimes no at all (even with 2 threads), sometimes could be worse than serialization.

We are not using MPS since it is not available on pegasus (not sure about it). Per my understanding, the multiprocess cannot share the GPU and it uses context switching. Multithreading using separate stream share the GPU, but it seems that the sharing is very poor. My simple test program with Nsight nsys shows that with 4 or 8 threads the overlap takes quite a bit time and also it all depends (you cannot guarantee similar overlaps for the same code).

I also tested limiting the number of threads accessing the GPU at the same time, so far the best is limiting 1 thread a time, which yield about 110ms for each threads, which is still worse than the multiprocess approach)

So my question is: how to improve the multithreading performance? What is the best practice anyway? Why sharing the GPU is worse than not sharing? As stated in Nvidia document, multistream sharing is automatic and avoids event synchronization and could be better, but in most cases, it is on the contrary side.

Any input? Thanks

Dear @shangping.guo,
MPS is not supported on DRIVE AGX platform. How is the GPU occupancy for a single thread? Could you share reproducible code and nsight results to get more insights?

Hello @SivaRamaKrishnaNV ,

Thanks for the reply. Currently we are having issues using nsight system on pegasus (cuda injection failure). So I do not have sufficient information on the gpu occupancy. But it seems <50% since if I put two threads running on the GPU, it only increases 2ms.
If you can help on the nsight issue I could be able to provide you more details

Thanks
Shangping

Hi @shangping.guo , is there any topic about this? If not, could you create one for it? Thanks.

yes, create a topic: Nsys cannot collect cuda information on Drive OS 5.1

Hello @SivaRamaKrishnaNV @VickNV @kayccc Finally I am able to profile GPU performance using nsight system with drive OS 5.2.6 and we can finally continue this topic now.
I first run one thread with TensorRT (separate execution context, separate cuda stream) with 14ms.

Then I add a second thread to the process with TensorRT (separate execution context and cuda stream), the time increase to 112ms

The zoom-out graph:

We can clearly see that each tensorRT kernel launch (inside the enqueue) will need a pthread_mutex_lock, which is the main reason to slow down the latency. Apparently this is far from optimized. This indicates that any time only one sub-kernel is running since it needs to gain the lock. What would be the correct way to do this?

Thanks

By the way, if you need the report on this, I can send to you.

@VickNV Can you take a look at my post above? Thanks

As Multithreaded tensorRT performance drops dramatically - #3 by SivaRamaKrishnaNV, Please help to provide reproducing code/steps for us to easily look into it. Thanks.

@VickNV This is the profiling of our product software, which is complicated and large. I really do not know if we can do the reproduce. Probably we can make a simple test code, but it also takes quite a few effort. What is your opinion or any simple methods?

Can you observe the issue with any of our sample applications?

@VickNV Which sample applications? for tensorRT? where to download it? Thanks

I mean if you can observe the issue based on any of the samples under /usr/src/tensorrt/samples, it will be easier.

@VickNV Thanks. I will check them out.

@VickNV I tried the sample code sampleCharRNN and add a simple multithread tests. The performance for 1 thread, 2 threads and 5 threads. I observed that synchronization is needed when multiple tensorRT are running. Each pthread_mutex_lock will increase the latency. But the sample has much less synchronization than in our software, so the performance degradation is not significant.
Single thread (zoomed for one single kernel launch)


Two threaded (zommed for the same kernel launch)

from 1ms (single) to 4.9ms (two threads).

Dear @shangping.guo,
Could you save the profiling result and attach here to load locally(Nsight Systems User Guide :: Nsight Systems Documentation)

@SivaRamaKrishnaNV Thanks, please see attached profiling results
report_charrnn_mt2.qdrep (1.2 MB)
The report for our software profiling cannot be uploaded, maybe too large (5.7MB)

Please share your change on the sample for this result and point out the issue in the report. Thanks.

sampleChar.tar (50 KB)
I only make the main as a function, and then generate multiple threads on it. see the attachement
@VickNV