TensorRT Version:
8.5 GPU Type:
Jetson Orin Nvidia Driver Version:
11.4 CUDA Version:
11.4 CUDNN Version:
N/A Operating System + Version:
Ubuntu 20.04
Description
Hi,
I have an application that utilizes two threads, each with its own CUDA stream, to perform inference on the same AI models using the TensorRT API. Additionally, I ensured that each thread was assigned to a separate CPU core.
However, during the profiling of my application using Nsight System, I observed that when one CUDA stream individually performed the AI model inference, the CPU status consistently showed as ‘Running’. On the other hand, when both CUDA streams performed inference in parallel, the CPU status occasionally appeared as “Unscheduled”, as depicted in the attached image. I am curious about the reasons behind this phenomenon. Considering that the two threads were bound to different CPU cores, I would expect no interference between them.
After including osrt in the Nsight System --trace argument, I observed that the Unscheduled segments of the CPU threads are being hindered by the pthread_mutex_lock function. The call stack for this scenario is as follows:
However, when performing the same test on an x86 machine with a discrete GPU GTX-2080, there was no occurrence of the pthread_mutex_lock function within the os runtime library call stack.
I have prepared a sample code to replicate the observed phenomenon. You can find it attached to this message. The code I provided is based on the sample code provided in TensorRT version 8502. sampleResnet.tar.gz (1.6 MB)
Here are the steps to reproduce:
Move the demo to tensorrt sample code folder such as /usr/src/tensorrt/samples
Untar with command tar xvzf sampleResnet.tar.gz
cd sampleResnet
Compile engine file sudo ./compile_engine.sh
Compile demo code sudo make -j8 && cp ../../bin/sample_onnx_resnet ./
Profile with Nsight System nsys profile --stats=true --trace=cuda,nvtx,osrt --force-overwrite true --output=orin_resnet ./sample_onnx_resnet
I have tested this demo code on both Orin and x86 platforms, and the corresponding report files have been attached below for reference. x86_resnet.nsys-rep (1.3 MB) orin_resnet.nsys-rep (2.4 MB)
Both Nsight System reports for Orin and x86 show the occurrence of pthread_mutex_lock during model inference on different CUDA streams, but Orin has much more than x86 shown in pictures below.
Would you mind trying this with our trtexec to see if the same behavior occurs?
This can help us to confirm whether the issue is from TensorRT itself or the application.
Sorry, I think trtexec is not able to reproduce the scenario described above. The key point is to infer two tensorRT engines concurrent in separated threads with two different CUDA streams. If there is only one CUDA stream we would not observe the pthread_mutex_lock call stack.
I apologize for any confusion, but when you mentioned the desktop, are you referring to the x86 platform?
When I ran the same command using trtexec on x86, I noticed that pthread_mutex_lock was also present. However, the notable difference is that the number of pthread_mutex_lock occurrences on the Jetson Orin is considerably higher.
Hi @AastaLLL , I noticed that @kayccc has marked this post as nvbugs. Could you please confirm if the pthread_mutex_lock issue has been verified as a bug by your internal team?