Performance Discrepancy - Python API vs. trtexec on Jetson AGX Orin Board

Hey Nvidia Forum community,

I’m facing a performance discrepancy on the Jetson AGX Orin 32GB Developer Kit board and would love to get your insights on the matter. Specifically, I’ve noticed a significant difference in latency results between using the Python API and trtexec. Surprisingly, this wasn’t the case when I was working with a T4 GPU.
I am using JetPack 5.1.1 on the AGX Orin board.

TRT exec command:

 /usr/src/tensorrt/bin/trtexec --onnx=.../best.onnx --saveEngine=.../best.engine

TRT exec output:

[07/05/2023-00:55:57] [I] === Performance summary ===
[07/05/2023-00:55:57] [I] Throughput: 119.089 qps
[07/05/2023-00:55:57] [I] Latency: min = 8.48901 ms, max = 13.4727 ms, mean = 9.13869 ms, median = 8.75708 ms, percentile(90%) = 10.3955 ms, percentile(95%) = 11.6753 ms, percentile(99%) = 12.9832 ms
[07/05/2023-00:55:57] [I] Enqueue Time: min = 1.14331 ms, max = 3.45764 ms, mean = 2.09376 ms, median = 2.09064 ms, percentile(90%) = 2.96436 ms, percentile(95%) = 3.04156 ms, percentile(99%) = 3.36731 ms
[07/05/2023-00:55:57] [I] H2D Latency: min = 0.42981 ms, max = 0.866577 ms, mean = 0.737847 ms, median = 0.738739 ms, percentile(90%) = 0.774658 ms, percentile(95%) = 0.786621 ms, percentile(99%) = 0.83374 ms
[07/05/2023-00:55:57] [I] GPU Compute Time: min = 7.90393 ms, max = 12.6794 ms, mean = 8.37038 ms, median = 7.9917 ms, percentile(90%) = 9.60596 ms, percentile(95%) = 10.8478 ms, percentile(99%) = 12.171 ms
[07/05/2023-00:55:57] [I] D2H Latency: min = 0.0187988 ms, max = 0.0332031 ms, mean = 0.0304636 ms, median = 0.0306396 ms, percentile(90%) = 0.0317383 ms, percentile(95%) = 0.0319824 ms, percentile(99%) = 0.0328369 ms
[07/05/2023-00:55:57] [I] Total Host Walltime: 3.02294 s
[07/05/2023-00:55:57] [I] Total GPU Compute Time: 3.01334 s
[07/05/2023-00:55:57] [W] * GPU compute time is unstable, with coefficient of variance = 11.1399%.
[07/05/2023-00:55:57] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/05/2023-00:55:57] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/05/2023-00:55:57] [I] 

Python code:

class ModelTRT():
 
 
    def __init__(self, sEnginePath, sPrecision, bEnd2end = False):
        self.n_classes = 2
        self.class_names = [ 'nonCoded', 'coded']
 
 
        self.sPrecision = sPrecision
        self.bEnd2end = bEnd2end
 
        logger = trt.Logger(trt.Logger.WARNING)
        logger.min_severity = trt.Logger.Severity.ERROR
        runtime = trt.Runtime(logger)
        trt.init_libnvinfer_plugins(logger,'') # initialize TensorRT plugins
        with open(sEnginePath, "rb") as f:
            serialized_engine = f.read()
        engine = runtime.deserialize_cuda_engine(serialized_engine)
        self.imgsz = engine.get_binding_shape(0)[2:]  # get the read shape of model, in case user input it wrong
        self.context = engine.create_execution_context()
        self.inputs, self.outputs, self.bindings = [], [], []
        self.stream = cuda.Stream()
        for binding in engine:
            size = trt.volume(engine.get_binding_shape(binding))
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            self.bindings.append(int(device_mem))
            if engine.binding_is_input(binding):
                self.inputs.append({'host': host_mem, 'device': device_mem})
            else:
                self.outputs.append({'host': host_mem, 'device': device_mem})
 
 
    def infer(self, aImage, bProfile = False):
 
 
        self.inputs[0]['host'] = np.ravel(aImage)
 
        if bProfile:
            fCPU2GPUStart = time.time()
 
        # transfer data to the gpu
        for inp in self.inputs:
            cuda.memcpy_htod_async(inp['device'], inp['host'], self.stream)
        self.stream.synchronize()
 
        if bProfile:
            fInferenceStart = time.time()
 
        # run inference
        self.context.execute_async_v2(
            bindings=self.bindings,
            stream_handle=self.stream.handle
        )
        self.stream.synchronize()
        
        if bProfile:
            fGPU2CPUStart = time.time()
 
        # fetch outputs from gpu
        for out in self.outputs:
            cuda.memcpy_dtoh_async(out['host'], out['device'], self.stream)
        # synchronize stream
        self.stream.synchronize()
 
        if bProfile:
            fPostProStart = time.time()
 
        data = [out['host'] for out in self.outputs]
 
        if self.bEnd2end:
            num, final_boxes, final_scores, final_cls_inds = data
            final_boxes = np.reshape(final_boxes, (-1, 4))
            dets = np.concatenate([final_boxes[:num[0]], np.array(final_scores)[:num[0]].reshape(-1, 1), np.array(final_cls_inds)[:num[0]].reshape(-1, 1)], axis=-1)
        else:
            dets = np.reshape(data, (1, 6, -1))
 
        if bProfile:
            
            return dets, fCPU2GPUStart, fInferenceStart, fGPU2CPUStart, fPostProStart
        
        else:
            return dets

Python output (median time of 200 executions):

 Inference time (s):  0.015452146530151367

Why such a difference in the Jetson AGX Orin board?
I attach also the onnx model.
best.onnx (11.8 MB)

Thanks in advance,

Hi,

Do you use the same profiling code on T4?
Thanks.

Hi,
Yes exactly the same.

Hi,

Thanks for the confirmation.

We are going to reproduce this in our environment.
Would you also share the result (C vs. Python) on T4 for our reference?

Thanks.

Results in T4:

trtexec:

[07/06/2023-07:21:32] [I] === Performance summary ===
[07/06/2023-07:21:32] [I] Throughput: 143.553 qps
[07/06/2023-07:21:32] [I] Latency: min = 8.90747 ms, max = 10.5049 ms, mean = 9.16742 ms, median = 9.18896 ms, percentile(90%) = 9.21436 ms, percentile(95%) = 9.3103 ms, percentile(99%) = 9.39819 ms
[07/06/2023-07:21:32] [I] Enqueue Time: min = 0.626282 ms, max = 0.80928 ms, mean = 0.673718 ms, median = 0.670776 ms, percentile(90%) = 0.707275 ms, percentile(95%) = 0.7229 ms, percentile(99%) = 0.760971 ms
[07/06/2023-07:21:32] [I] H2D Latency: min = 2.13812 ms, max = 2.15649 ms, mean = 2.14479 ms, median = 2.14453 ms, percentile(90%) = 2.14673 ms, percentile(95%) = 2.14786 ms, percentile(99%) = 2.15039 ms
[07/06/2023-07:21:32] [I] GPU Compute Time: min = 6.67578 ms, max = 8.27956 ms, mean = 6.935 ms, median = 6.95642 ms, percentile(90%) = 6.98145 ms, percentile(95%) = 7.08179 ms, percentile(99%) = 7.16675 ms
[07/06/2023-07:21:32] [I] D2H Latency: min = 0.0859375 ms, max = 0.114258 ms, mean = 0.0876223 ms, median = 0.0874023 ms, percentile(90%) = 0.088501 ms, percentile(95%) = 0.0888672 ms, percentile(99%) = 0.0921631 ms
[07/06/2023-07:21:32] [I] Total Host Walltime: 3.02328 s
[07/06/2023-07:21:32] [I] Total GPU Compute Time: 3.00979 s
[07/06/2023-07:21:32] [W] * GPU compute time is unstable, with coefficient of variance = 1.49505%.
[07/06/2023-07:21:32] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/06/2023-07:21:32] [I] Explanations of the performance metrics are printed in the verbose logs.

Python:

Inference time (s):  0.011955976486206055

On the T4 I have CUDA 11.8 and tensorrt 8.6.1.6-1 version.

By the way, in the python one (in both Jetson and T4), I take as the inference time fGPU2CPUStart - fInferenceStart (execute_async_v2 + stream.synchronize() commands).

In terms of FP16 and INT8 (the ones I mentioned previously were FP32), the difference between trtexec and the Python API on the Jetson AGX Orin compared to the T4 is even more significant.

Hi,

I resolved the issue by using sudo jetson_clocks command. Now, there isn’t a significant difference between trtexec and the Python API, so it’s likely that trtexec internally adjusts the clocks by itself.

Thank you.

Hi,

Good to know you have solved the issue.

By default, Jetson uses dynamic frequency.
In trtexec, there is a warmup stage before the benchmark. This might be the cause of the difference.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.