In jetson agx orin, tensorRT performance


TensorRT Version:
GPU Type: Jetson Agx orin
Nvidia Driver Version:
CUDA Version: 11.4
CUDNN Version:
Operating System + Version: Ubuntu 20.04
**etc : opencv 4.8.1 with cuda


I installed QT Creator on the Jetson AGX ORIN device and created the following example:

tensorRT C++ example

I confirmed that it takes 20ms for the program to perform tensorRT detection. (same image 20times)

The final goal of the program is to read and analyze images from a camera.

So, I assumed a camera and created a simple thread.

However, I noticed that the performance of tensorRT detection deteriorates when multiple threads are created.

class TestThread: public QThread

    void run() override;

void TestThread::run(){
           //later….   get image from camera…   

MainWindow::MainWindow(QWidget *parent)
    : QMainWindow(parent)
    , ui(new Ui::MainWindow)
    //TensorRT* tensorRT = new TensorRT();

    TestThread* t1 = new TestThread();
    TestThread* t2 = new TestThread();
    TestThread* t3 = new TestThread();
    TestThread* t4 = new TestThread();


    //detect Image Thread
    object_detection_image = new Object_detection_image();

The detection time is 300ms!!
No thread 20ms → four threads 300ms

I want to know why this is happening.
Could it be that on the Jetson board, tensorRT performance degrades due to the increased CPU usage caused by threads in the software?

(When I ran the same source code on a desktop Windows [ Core i7-11700K 3.60GHz, RTX 3060], the detect time is no problem No thread 20ms → four threads 20ms)

Please let me know if I am doing something wrong.


First, could you set the device to maximize to make sure the tasks will be done quickly?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Then please share the tegrastats result with us.


Based on your implementation, the inference is done in the main process rather than thread.
Is that correct?


performance time reduced a little

300ms → 130ms

however, I’m still pursuing the performance time with no thread which is 20ms.

ah! sorry i forgot tegrastats… next week i share tegrastats
there is jetson in my workplace …

no, detection is “Object_detection_image” thread

void Object_detection_image::run() {
    YoloV8Config config;
    std::string onnxModelPath;
    std::string inputImage;

    onnxModelPath = "hovlane_aimmo_2nd_yeoncheon_seg_best_int32.onnx";
    inputImage = "first_230708_113958.jpg";

    // Create the YoloV8 engine
    YoloV8 yoloV8(onnxModelPath, config);

    // Read the input image
    auto img = cv::imread(inputImage);
    if (img.empty()) {
        std::cout << "Error: Unable to read image at path '" << inputImage << "'" << std::endl;
        return ;

    // Run inference
    time_t ss, ee;
    //qDebug() << "분석 시작합니다" << QString::number( float(clock())/CLOCKS_PER_SEC,'f',4);
    ss = clock();
    const auto objects = yoloV8.detectObjects(img);
    ee = clock();
    qDebug() << "Detection time left_1_Img = " << QString::number( float(ee - ss)/CLOCKS_PER_SEC,'f',4) << " sec";

it’s Tegrastat

Nomal status -
RAM 6296/62780MB (lfb 12855x4MB) SWAP 0/31390MB (cached 0MB)
EMC_FREQ 0% GR3D_FREQ 0% CV0@-256C CPU@51.562C Tboard@39C SOC2@48.343C Tdiode@40.25C SOC0@48C CV1@-256C GPU@46C tj@51.562C SOC1@46.593C CV2@-256C

running program(testThread is four)-
RAM 7810/62780MB (lfb 12631x4MB) SWAP 0/31390MB (cached 0MB)
EMC_FREQ 0% GR3D_FREQ 1% CV0@-256C CPU@52.656C Tboard@39C SOC2@48.781C Tdiode@40.5C SOC0@48.218C CV1@-256C GPU@46.625C tj@52.656C SOC1@47.062C CV2@-256C

I think I found where the problem is.

The code that keeps giving delay is as the below

qDebug() << "88888" << QString::number( float(clock() - ss)/CLOCKS_PER_SEC,'f',4) << " sec";

// Copy the outputs back to CPU

for (int batch = 0; batch < batchSize; ++batch) {
    // Batch
    std::vector<std::vector<float>> batchOutputs{};
    for (int32_t outputBinding = numInputs; outputBinding < m_engine->getNbBindings(); ++outputBinding) {
        // We start at index m_inputDims.size() to account for the inputs in our m_buffers
        std::vector<float> output;
        auto outputLenFloat = m_outputLengthsFloat[outputBinding - numInputs];
        // Copy the output
        **checkCudaErrorCode(cudaMemcpyAsync(, static_cast<char*>(m_buffers[outputBinding]) + (batch * sizeof(float) * outputLenFloat), outputLenFloat * sizeof(float), cudaMemcpyDeviceToHost, inferenceCudaStream));**
qDebug() << "99999" << QString::number( float(clock() - ss)/CLOCKS_PER_SEC,'f',4) << " sec";

no thread >>>>
88888 “0.0034” sec
99999 “0.0076” sec

four thread >>>>>>
88888 “0.0195” sec
99999 “0.1356” sec


“checkCudaErrorCode(cudaMemcpyAsync(, static_cast<char*>(m_buffers[outputBinding]) + (batch * sizeof(float) * outputLenFloat), outputLenFloat * sizeof(float), cudaMemcpyDeviceToHost, inferenceCudaStream));”

for loop is two. and first time is 1000ms!! at four thread

what is cudaMemcpyAsync ??? why so slow at four thread??

How should I solve this problem?


Is there a single inferenceCudaStream or multiple for each thread?
Since the tasks launched with the same CUDA stream will be executed in order, you may want to use different CUDA streams to allow the tasks to run in parallel.



I found that… there is no problem with tensorRT…

Multithreading is strange in Ubuntu!!!

detection code replace ->>>>>>>>

for(int i=0; i< 99999999; i++){
    int a = 1+i; ... int e= 1+i

result is same!! in “Windows” there is no difference “running time”
but in ubuntu there is difference “running time”!!! why??? T.T

this topic is not appropriate !!

i will delete “this topic”



Do you have an Ubuntu desktop environment?
If yes, would you mind giving it a try to see if this only occurs in the Jetson environment?


it’s not Jetson problem but ubuntu problem!


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.