Model inference on multiple cuda streams with tensorrt api

zhi_xz · September 14, 2023, 7:09am

Environment

TensorRT Version:
8.5
GPU Type:
Jetson Orin
Nvidia Driver Version:
11.4
CUDA Version:
11.4
CUDNN Version:
N/A
Operating System + Version:
Ubuntu 20.04

Description

Hi,
I have an application that utilizes two threads, each with its own CUDA stream, to perform inference on the same AI models using the TensorRT API. Additionally, I ensured that each thread was assigned to a separate CPU core.

However, during the profiling of my application using Nsight System, I observed that when one CUDA stream individually performed the AI model inference, the CPU status consistently showed as ‘Running’. On the other hand, when both CUDA streams performed inference in parallel, the CPU status occasionally appeared as “Unscheduled”, as depicted in the attached image. I am curious about the reasons behind this phenomenon. Considering that the two threads were bound to different CPU cores, I would expect no interference between them.

zhi_xz · September 14, 2023, 8:47am

After including osrt in the Nsight System --trace argument, I observed that the Unscheduled segments of the CPU threads are being hindered by the pthread_mutex_lock function. The call stack for this scenario is as follows:

pthread_mutex_lock
Begins: 7.44907s
Ends: 7.44912s (+47.904 μs)

Call stack:
libpthread-2.31.so!__pthread_mutex_lock_full
libToolsInjection64.so!NSYS_OSRT_pthread_mutex_lock_1
libcuda.so.1[5 Frames]
libnvinfer.so.8.5.10[9 Frames]
[Broken backtraces]![Broken backtraces]

However, when performing the same test on an x86 machine with a discrete GPU GTX-2080, there was no occurrence of the pthread_mutex_lock function within the os runtime library call stack.

I am curious as to why there is such a significant discrepancy in this phenomenon when comparing x86 and Jetson.

zhi_xz · September 15, 2023, 9:10am

Attachments are the Nsight System report files.

Drive Orin
9_14_drive_version_3.nsys-rep (15.5 MB)

x86
9_14_x86_version_3.nsys-rep (7.8 MB)

zhi_xz · September 15, 2023, 11:30am

Hi,

I have prepared a sample code to replicate the observed phenomenon. You can find it attached to this message. The code I provided is based on the sample code provided in TensorRT version 8502.
sampleResnet.tar.gz (1.6 MB)

Here are the steps to reproduce:

Move the demo to tensorrt sample code folder such as /usr/src/tensorrt/samples
Untar with command tar xvzf sampleResnet.tar.gz
cd sampleResnet
Compile engine file sudo ./compile_engine.sh
Compile demo code sudo make -j8 && cp ../../bin/sample_onnx_resnet ./
Profile with Nsight System nsys profile --stats=true --trace=cuda,nvtx,osrt --force-overwrite true --output=orin_resnet ./sample_onnx_resnet

I have tested this demo code on both Orin and x86 platforms, and the corresponding report files have been attached below for reference.
x86_resnet.nsys-rep (1.3 MB)
orin_resnet.nsys-rep (2.4 MB)

Both Nsight System reports for Orin and x86 show the occurrence of pthread_mutex_lock during model inference on different CUDA streams, but Orin has much more than x86 shown in pictures below.

Orin

x86

Please let me know if you need any further information. Thank you.

kayccc · September 20, 2023, 5:10am

Your Jetson module is Orin or Orin NX?
Which JetPack SW?

AastaLLL · September 20, 2023, 6:05am

Hi,

Could you share a simple reproducible source with us?
More, have you created two TensorRT engines so it can infer parallelly?

Thanks

zhi_xz · September 21, 2023, 2:17am

Jetson AGX Orin
R35 (release), REVISION: 3.1, GCID: 32827747, BOARD: t186ref, EABI: aarch64, DATE: Sun Mar 19 15:19:21 UTC 2023

zhi_xz · September 21, 2023, 2:21am

The reproducible source code is attached in sampleResnet.tar.gz since there are several files I didn’t inline the code in this post.

In the source code, I have created two tensorRT engines to infer in different CUDA streams.

AastaLLL · September 21, 2023, 4:56am

Hi,

Thanks for sharing the source.

We need to discuss this with our internal team
Will get back to you later.

AastaLLL · September 21, 2023, 5:10am

Hi

Would you mind trying this with our trtexec to see if the same behavior occurs?
This can help us to confirm whether the issue is from TensorRT itself or the application.

Thanks.

zhi_xz · September 21, 2023, 8:28am

Hi,

Sorry, I think trtexec is not able to reproduce the scenario described above. The key point is to infer two tensorRT engines concurrent in separated threads with two different CUDA streams. If there is only one CUDA stream we would not observe the pthread_mutex_lock call stack.

zhi_xz · September 24, 2023, 3:44am

Hi,

I have tried using trtexec with the flag --streams=2 --threads=2 and encountered the same behavior.

nsys profile --stats=true --trace=cuda,nvtx,osrt --force-overwrite true --trace-fork-before-exec=true --output=trt_9_15 /usr/src/tensorrt/bin/trtexec --loadEngine=mobilenet_class.engine --streams=2 --threads=2

AastaLLL · September 25, 2023, 5:36am

Hi,

Would you mind running the same common on a desktop to confirm the difference in the mutex usage?

Thanks.

zhi_xz · September 25, 2023, 6:10am

Hi,

I apologize for any confusion, but when you mentioned the desktop, are you referring to the x86 platform?
When I ran the same command using trtexec on x86, I noticed that pthread_mutex_lock was also present. However, the notable difference is that the number of pthread_mutex_lock occurrences on the Jetson Orin is considerably higher.

x86

Orin

For your reference, I have included the profiled reports. Thank you.
Uploading: nsys_report.zip…

AastaLLL · September 26, 2023, 6:57am

Hi,

Thanks for the experience.
The mutex call on Orin does look much higher compared to dGPU with the same trtexec command.

We need to check this with our internal team.
Will get back to you later.

AastaLLL · October 13, 2023, 5:31am

Hi,

Thanks for your patience.

We are waiting for our internal team’s feedback.
In the meantime, could you check if using ‘–useSpinWait’ helps?

Thanks.

zhi_xz · October 13, 2023, 6:22am

Hi,

It appears that using the --useSpinWait flag does not help in reducing the frequency of pthread_mutex_lock calls.

AastaLLL · October 16, 2023, 6:46am

Hi,

Thanks for the testing.

We are checking this issue with our internal team.
Will get back to you later.

zhi_xz · October 25, 2023, 8:06am

Hi @AastaLLL , I noticed that @kayccc has marked this post as nvbugs. Could you please confirm if the pthread_mutex_lock issue has been verified as a bug by your internal team?

AastaLLL · October 26, 2023, 3:40am

Hi,

No, the tag indicates there is an internal bug to track this.

Our internal team still working on this issue.
Will get back to you once we get any further updates.

Thanks.

Topic		Replies	Views
Tensorrt Threads affect each other during multithreaded inference TensorRT tensorrt	16	1317	September 6, 2024
How can I access the same TensorRT engine model in different thread TensorRT cudnn	1	545	November 27, 2023
[TensorRT] engine happed a error in multithreaded TensorRT tensorrt , cuda	2	1515	January 19, 2023
Cuda Error in launchPwgenKernel- When running a specific engine in async TensorRT tensorrt	9	2154	June 11, 2022
Adding multiple inference on TensorRT (Invalid Resource Handle Error) TensorRT	2	1704	December 4, 2019
TensorRT concurrent or parrellel inference in one GPU in jetson platform TensorRT jetson-inference	1	591	June 29, 2023
TensorRT waiting after inference seemingly for no reason TensorRT tensorrt , cuda , performance , python	12	1493	October 20, 2022
Unable to run two TensorRT models in a cascade manner TensorRT tensorrt , python	7	4928	October 12, 2021
TensorRT INT8 inference is slower than FP16 in models with conditional flow Jetson Orin Nano tensorrt , cuda , jetson-inference , onnx	5	898	June 10, 2024
Trtexec multi-source (streams) and multi-batch performance test failed TensorRT	5	976	August 11, 2023

Model inference on multiple cuda streams with tensorrt api

Environment

Description

Related topics