Performance Discrepancy - Python API vs. trtexec on Jetson AGX Orin Board

inigoauz.27 · July 5, 2023, 8:06am

Hey Nvidia Forum community,

I’m facing a performance discrepancy on the Jetson AGX Orin 32GB Developer Kit board and would love to get your insights on the matter. Specifically, I’ve noticed a significant difference in latency results between using the Python API and trtexec. Surprisingly, this wasn’t the case when I was working with a T4 GPU.
I am using JetPack 5.1.1 on the AGX Orin board.

TRT exec command:

 /usr/src/tensorrt/bin/trtexec --onnx=.../best.onnx --saveEngine=.../best.engine

TRT exec output:

[07/05/2023-00:55:57] [I] === Performance summary ===
[07/05/2023-00:55:57] [I] Throughput: 119.089 qps
[07/05/2023-00:55:57] [I] Latency: min = 8.48901 ms, max = 13.4727 ms, mean = 9.13869 ms, median = 8.75708 ms, percentile(90%) = 10.3955 ms, percentile(95%) = 11.6753 ms, percentile(99%) = 12.9832 ms
[07/05/2023-00:55:57] [I] Enqueue Time: min = 1.14331 ms, max = 3.45764 ms, mean = 2.09376 ms, median = 2.09064 ms, percentile(90%) = 2.96436 ms, percentile(95%) = 3.04156 ms, percentile(99%) = 3.36731 ms
[07/05/2023-00:55:57] [I] H2D Latency: min = 0.42981 ms, max = 0.866577 ms, mean = 0.737847 ms, median = 0.738739 ms, percentile(90%) = 0.774658 ms, percentile(95%) = 0.786621 ms, percentile(99%) = 0.83374 ms
[07/05/2023-00:55:57] [I] GPU Compute Time: min = 7.90393 ms, max = 12.6794 ms, mean = 8.37038 ms, median = 7.9917 ms, percentile(90%) = 9.60596 ms, percentile(95%) = 10.8478 ms, percentile(99%) = 12.171 ms
[07/05/2023-00:55:57] [I] D2H Latency: min = 0.0187988 ms, max = 0.0332031 ms, mean = 0.0304636 ms, median = 0.0306396 ms, percentile(90%) = 0.0317383 ms, percentile(95%) = 0.0319824 ms, percentile(99%) = 0.0328369 ms
[07/05/2023-00:55:57] [I] Total Host Walltime: 3.02294 s
[07/05/2023-00:55:57] [I] Total GPU Compute Time: 3.01334 s
[07/05/2023-00:55:57] [W] * GPU compute time is unstable, with coefficient of variance = 11.1399%.
[07/05/2023-00:55:57] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/05/2023-00:55:57] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/05/2023-00:55:57] [I]

Python code:

class ModelTRT():
 
 
    def __init__(self, sEnginePath, sPrecision, bEnd2end = False):
        self.n_classes = 2
        self.class_names = [ 'nonCoded', 'coded']
 
 
        self.sPrecision = sPrecision
        self.bEnd2end = bEnd2end
 
        logger = trt.Logger(trt.Logger.WARNING)
        logger.min_severity = trt.Logger.Severity.ERROR
        runtime = trt.Runtime(logger)
        trt.init_libnvinfer_plugins(logger,'') # initialize TensorRT plugins
        with open(sEnginePath, "rb") as f:
            serialized_engine = f.read()
        engine = runtime.deserialize_cuda_engine(serialized_engine)
        self.imgsz = engine.get_binding_shape(0)[2:]  # get the read shape of model, in case user input it wrong
        self.context = engine.create_execution_context()
        self.inputs, self.outputs, self.bindings = [], [], []
        self.stream = cuda.Stream()
        for binding in engine:
            size = trt.volume(engine.get_binding_shape(binding))
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            self.bindings.append(int(device_mem))
            if engine.binding_is_input(binding):
                self.inputs.append({'host': host_mem, 'device': device_mem})
            else:
                self.outputs.append({'host': host_mem, 'device': device_mem})
 
 
    def infer(self, aImage, bProfile = False):
 
 
        self.inputs[0]['host'] = np.ravel(aImage)
 
        if bProfile:
            fCPU2GPUStart = time.time()
 
        # transfer data to the gpu
        for inp in self.inputs:
            cuda.memcpy_htod_async(inp['device'], inp['host'], self.stream)
        self.stream.synchronize()
 
        if bProfile:
            fInferenceStart = time.time()
 
        # run inference
        self.context.execute_async_v2(
            bindings=self.bindings,
            stream_handle=self.stream.handle
        )
        self.stream.synchronize()
        
        if bProfile:
            fGPU2CPUStart = time.time()
 
        # fetch outputs from gpu
        for out in self.outputs:
            cuda.memcpy_dtoh_async(out['host'], out['device'], self.stream)
        # synchronize stream
        self.stream.synchronize()
 
        if bProfile:
            fPostProStart = time.time()
 
        data = [out['host'] for out in self.outputs]
 
        if self.bEnd2end:
            num, final_boxes, final_scores, final_cls_inds = data
            final_boxes = np.reshape(final_boxes, (-1, 4))
            dets = np.concatenate([final_boxes[:num[0]], np.array(final_scores)[:num[0]].reshape(-1, 1), np.array(final_cls_inds)[:num[0]].reshape(-1, 1)], axis=-1)
        else:
            dets = np.reshape(data, (1, 6, -1))
 
        if bProfile:
            
            return dets, fCPU2GPUStart, fInferenceStart, fGPU2CPUStart, fPostProStart
        
        else:
            return dets

Python output (median time of 200 executions):

 Inference time (s):  0.015452146530151367

Why such a difference in the Jetson AGX Orin board?
I attach also the onnx model.
best.onnx (11.8 MB)

Thanks in advance,

AastaLLL · July 5, 2023, 8:39am

Hi,

Do you use the same profiling code on T4?
Thanks.

inigoauz.27 · July 5, 2023, 8:46am

Hi,
Yes exactly the same.

AastaLLL · July 6, 2023, 7:01am

Hi,

Thanks for the confirmation.

We are going to reproduce this in our environment.
Would you also share the result (C vs. Python) on T4 for our reference?

Thanks.

inigoauz.27 · July 6, 2023, 7:25am

Results in T4:

trtexec:

[07/06/2023-07:21:32] [I] === Performance summary ===
[07/06/2023-07:21:32] [I] Throughput: 143.553 qps
[07/06/2023-07:21:32] [I] Latency: min = 8.90747 ms, max = 10.5049 ms, mean = 9.16742 ms, median = 9.18896 ms, percentile(90%) = 9.21436 ms, percentile(95%) = 9.3103 ms, percentile(99%) = 9.39819 ms
[07/06/2023-07:21:32] [I] Enqueue Time: min = 0.626282 ms, max = 0.80928 ms, mean = 0.673718 ms, median = 0.670776 ms, percentile(90%) = 0.707275 ms, percentile(95%) = 0.7229 ms, percentile(99%) = 0.760971 ms
[07/06/2023-07:21:32] [I] H2D Latency: min = 2.13812 ms, max = 2.15649 ms, mean = 2.14479 ms, median = 2.14453 ms, percentile(90%) = 2.14673 ms, percentile(95%) = 2.14786 ms, percentile(99%) = 2.15039 ms
[07/06/2023-07:21:32] [I] GPU Compute Time: min = 6.67578 ms, max = 8.27956 ms, mean = 6.935 ms, median = 6.95642 ms, percentile(90%) = 6.98145 ms, percentile(95%) = 7.08179 ms, percentile(99%) = 7.16675 ms
[07/06/2023-07:21:32] [I] D2H Latency: min = 0.0859375 ms, max = 0.114258 ms, mean = 0.0876223 ms, median = 0.0874023 ms, percentile(90%) = 0.088501 ms, percentile(95%) = 0.0888672 ms, percentile(99%) = 0.0921631 ms
[07/06/2023-07:21:32] [I] Total Host Walltime: 3.02328 s
[07/06/2023-07:21:32] [I] Total GPU Compute Time: 3.00979 s
[07/06/2023-07:21:32] [W] * GPU compute time is unstable, with coefficient of variance = 1.49505%.
[07/06/2023-07:21:32] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/06/2023-07:21:32] [I] Explanations of the performance metrics are printed in the verbose logs.

Python:

Inference time (s):  0.011955976486206055

On the T4 I have CUDA 11.8 and tensorrt 8.6.1.6-1 version.

By the way, in the python one (in both Jetson and T4), I take as the inference time fGPU2CPUStart - fInferenceStart (execute_async_v2 + stream.synchronize() commands).

inigoauz.27 · July 6, 2023, 7:29am

In terms of FP16 and INT8 (the ones I mentioned previously were FP32), the difference between trtexec and the Python API on the Jetson AGX Orin compared to the T4 is even more significant.

inigoauz.27 · July 6, 2023, 10:10am

Hi,

I resolved the issue by using sudo jetson_clocks command. Now, there isn’t a significant difference between trtexec and the Python API, so it’s likely that trtexec internally adjusts the clocks by itself.

Thank you.

AastaLLL · July 10, 2023, 7:32am

Hi,

Good to know you have solved the issue.

By default, Jetson uses dynamic frequency.
In trtexec, there is a warmup stage before the benchmark. This might be the cause of the difference.

Thanks.

Topic		Replies	Views
Inference slow even using TensorRT Jetson AGX Orin tensorrt	14	2535	November 6, 2023
Difference between running the inference with trtexec and tensorrt python API Jetson AGX Xavier tensorrt , python	3	3208	May 27, 2021
TensorRT Inconsistent Inference Performance with Python and Trtexec TensorRT tensorrt , cuda , jetson-inference , python , cudnn	0	376	April 2, 2024
TensorRT gives different results on Jetson Orin Jetson AGX Orin tensorrt , nvbugs	5	968	June 5, 2023
Performance difference between Jetpack and TensorRT versions Jetson Nano tensorrt , jetson-inference	6	626	May 26, 2023
Trtexec performance not close to benchmarks Jetson Orin NX tensorrt	1	569	December 19, 2023
Execution time much slower with TensorRT TensorRT tensorrt , cudnn , jetson-orin	0	155	April 2, 2025
Can't run nvcr.io/nvidia/l4t-tensorrt:r8.2.1-runtime on Orin AGX Jetson AGX Orin tensorrt	18	1424	May 13, 2022
Poor Inference Time on Jetson TX1 Jetson TX1 jetson-inference	3	873	June 21, 2022
Onnx -> TensorRT. No speed difference between models TensorRT	1	572	June 24, 2021

Performance Discrepancy - Python API vs. trtexec on Jetson AGX Orin Board

Related topics