Model inference on multiple cuda streams with tensorrt api

Environment

TensorRT Version:
8.5
GPU Type:
Jetson Orin
Nvidia Driver Version:
11.4
CUDA Version:
11.4
CUDNN Version:
N/A
Operating System + Version:
Ubuntu 20.04

Description

Hi,
I have an application that utilizes two threads, each with its own CUDA stream, to perform inference on the same AI models using the TensorRT API. Additionally, I ensured that each thread was assigned to a separate CPU core.

However, during the profiling of my application using Nsight System, I observed that when one CUDA stream individually performed the AI model inference, the CPU status consistently showed as ‘Running’. On the other hand, when both CUDA streams performed inference in parallel, the CPU status occasionally appeared as “Unscheduled”, as depicted in the attached image. I am curious about the reasons behind this phenomenon. Considering that the two threads were bound to different CPU cores, I would expect no interference between them.

After including osrt in the Nsight System --trace argument, I observed that the Unscheduled segments of the CPU threads are being hindered by the pthread_mutex_lock function. The call stack for this scenario is as follows:

pthread_mutex_lock
Begins: 7.44907s
Ends: 7.44912s (+47.904 μs)

Call stack:
libpthread-2.31.so!__pthread_mutex_lock_full
libToolsInjection64.so!NSYS_OSRT_pthread_mutex_lock_1
libcuda.so.1[5 Frames]
libnvinfer.so.8.5.10[9 Frames]
[Broken backtraces]![Broken backtraces]

However, when performing the same test on an x86 machine with a discrete GPU GTX-2080, there was no occurrence of the pthread_mutex_lock function within the os runtime library call stack.

I am curious as to why there is such a significant discrepancy in this phenomenon when comparing x86 and Jetson.

Attachments are the Nsight System report files.

Drive Orin
9_14_drive_version_3.nsys-rep (15.5 MB)

x86
9_14_x86_version_3.nsys-rep (7.8 MB)

Hi,

I have prepared a sample code to replicate the observed phenomenon. You can find it attached to this message. The code I provided is based on the sample code provided in TensorRT version 8502.
sampleResnet.tar.gz (1.6 MB)

Here are the steps to reproduce:

  • Move the demo to tensorrt sample code folder such as /usr/src/tensorrt/samples
  • Untar with command tar xvzf sampleResnet.tar.gz
  • cd sampleResnet
  • Compile engine file sudo ./compile_engine.sh
  • Compile demo code sudo make -j8 && cp ../../bin/sample_onnx_resnet ./
  • Profile with Nsight System nsys profile --stats=true --trace=cuda,nvtx,osrt --force-overwrite true --output=orin_resnet ./sample_onnx_resnet

I have tested this demo code on both Orin and x86 platforms, and the corresponding report files have been attached below for reference.
x86_resnet.nsys-rep (1.3 MB)
orin_resnet.nsys-rep (2.4 MB)

Both Nsight System reports for Orin and x86 show the occurrence of pthread_mutex_lock during model inference on different CUDA streams, but Orin has much more than x86 shown in pictures below.

Orin


x86

Please let me know if you need any further information. Thank you.

Your Jetson module is Orin or Orin NX?
Which JetPack SW?

Hi,

Could you share a simple reproducible source with us?
More, have you created two TensorRT engines so it can infer parallelly?

Thanks

  1. Jetson AGX Orin
  2. R35 (release), REVISION: 3.1, GCID: 32827747, BOARD: t186ref, EABI: aarch64, DATE: Sun Mar 19 15:19:21 UTC 2023

The reproducible source code is attached in sampleResnet.tar.gz since there are several files I didn’t inline the code in this post.

In the source code, I have created two tensorRT engines to infer in different CUDA streams.

Hi,

Thanks for sharing the source.

We need to discuss this with our internal team
Will get back to you later.

Hi

Would you mind trying this with our trtexec to see if the same behavior occurs?
This can help us to confirm whether the issue is from TensorRT itself or the application.

Thanks.

Hi,

Sorry, I think trtexec is not able to reproduce the scenario described above. The key point is to infer two tensorRT engines concurrent in separated threads with two different CUDA streams. If there is only one CUDA stream we would not observe the pthread_mutex_lock call stack.

Hi,

I have tried using trtexec with the flag --streams=2 --threads=2 and encountered the same behavior.

nsys profile --stats=true --trace=cuda,nvtx,osrt --force-overwrite true --trace-fork-before-exec=true --output=trt_9_15 /usr/src/tensorrt/bin/trtexec --loadEngine=mobilenet_class.engine --streams=2 --threads=2

Hi,

Would you mind running the same common on a desktop to confirm the difference in the mutex usage?

Thanks.

Hi,

I apologize for any confusion, but when you mentioned the desktop, are you referring to the x86 platform?
When I ran the same command using trtexec on x86, I noticed that pthread_mutex_lock was also present. However, the notable difference is that the number of pthread_mutex_lock occurrences on the Jetson Orin is considerably higher.

x86

Orin

For your reference, I have included the profiled reports. Thank you.
Uploading: nsys_report.zip…

Hi,

Thanks for the experience.
The mutex call on Orin does look much higher compared to dGPU with the same trtexec command.

We need to check this with our internal team.
Will get back to you later.

Hi,

Thanks for your patience.

We are waiting for our internal team’s feedback.
In the meantime, could you check if using ‘–useSpinWait’ helps?

Thanks.

Hi,

It appears that using the --useSpinWait flag does not help in reducing the frequency of pthread_mutex_lock calls.

Hi,

Thanks for the testing.

We are checking this issue with our internal team.
Will get back to you later.

Hi @AastaLLL , I noticed that @kayccc has marked this post as nvbugs. Could you please confirm if the pthread_mutex_lock issue has been verified as a bug by your internal team?

Hi,

No, the tag indicates there is an internal bug to track this.

Our internal team still working on this issue.
Will get back to you once we get any further updates.

Thanks.