Jetson Thor - INT8 quantization show no performance gain over FP16

Hello,

I’ve been playing with the Jetson Thor for a bit and I’ve gotten satisfying results for FP32/FP16 compared to the AGX’s. However, when it comes to INT8 or FP4, I notice that I am not seeing the expected performance gains over FP16.

I’ve mainly tried detections with RT-DETR, but I’ve been observing the same behavior with swine unet, which make me think that the issue might be tied to transformers quantization.

I have tried both implicit and explicit quantization, using either --best option, or doing a QAT after Q/DQ insertion with TensorRT-Model-Optimizer, but in all cases the int8 always ends up about the same or worst than the FP16.

FP16 :

[12/02/2025-15:16:27] [I] === Performance summary ===
[12/02/2025-15:16:27] [I] Throughput: 252.872 qps
[12/02/2025-15:16:27] [I] Latency: min = 3.76587 ms, max = 10.6292 ms, mean = 5.72842 ms, median = 7.44531 ms, percentile(90%) = 7.61548 ms, percentile(95%) = 7.65662 ms, percentile(99%) = 8.13245 ms
[12/02/2025-15:16:27] [I] Enqueue Time: min = 0.00366211 ms, max = 0.043335 ms, mean = 0.010914 ms, median = 0.0057373 ms, percentile(90%) = 0.0274048 ms, percentile(95%) = 0.0280762 ms, percentile(99%) = 0.0321045 ms
[12/02/2025-15:16:27] [I] H2D Latency: min = 0.0289917 ms, max = 4.65234 ms, mean = 1.92862 ms, median = 3.69739 ms, percentile(90%) = 3.82581 ms, percentile(95%) = 3.88184 ms, percentile(99%) = 4.07684 ms
[12/02/2025-15:16:27] [I] GPU Compute Time: min = 3.72095 ms, max = 6.85022 ms, mean = 3.79412 ms, median = 3.75574 ms, percentile(90%) = 3.85913 ms, percentile(95%) = 3.98755 ms, percentile(99%) = 4.11963 ms
[12/02/2025-15:16:27] [I] D2H Latency: min = 0.00463867 ms, max = 0.0262451 ms, mean = 0.00567765 ms, median = 0.00518799 ms, percentile(90%) = 0.0065918 ms, percentile(95%) = 0.00695801 ms, percentile(99%) = 0.0107422 ms
[12/02/2025-15:16:27] [I] Total Host Walltime: 3.00942 s
[12/02/2025-15:16:27] [I] Total GPU Compute Time: 2.88732 s
[12/02/2025-15:16:27] [W] * GPU compute time is unstable, with coefficient of variance = 3.97967%.
[12/02/2025-15:16:27] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[12/02/2025-15:16:27] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/02/2025-15:16:27] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v101303] [b9] # trtexec --loadEngine=models/opset16_engines/rtdetr_r50vd_6x_coco_from_paddle_opset16_640_fp16.engine --useCudaGraph

INT8 (generated with implicit quantization) :

[12/02/2025-15:17:08] [I] === Performance summary ===
[12/02/2025-15:17:08] [I] Throughput: 249.071 qps
[12/02/2025-15:17:08] [I] Latency: min = 3.86841 ms, max = 11.3739 ms, mean = 5.85741 ms, median = 4.26099 ms, percentile(90%) = 7.7793 ms, percentile(95%) = 7.84473 ms, percentile(99%) = 8.24225 ms
[12/02/2025-15:17:08] [I] Enqueue Time: min = 0.00402832 ms, max = 0.0432129 ms, mean = 0.0130806 ms, median = 0.00585938 ms, percentile(90%) = 0.0336914 ms, percentile(95%) = 0.0341797 ms, percentile(99%) = 0.0380859 ms
[12/02/2025-15:17:08] [I] H2D Latency: min = 0.0332031 ms, max = 4.2243 ms, mean = 1.97908 ms, median = 0.110107 ms, percentile(90%) = 3.94336 ms, percentile(95%) = 3.97266 ms, percentile(99%) = 4.1488 ms
[12/02/2025-15:17:08] [I] GPU Compute Time: min = 3.80896 ms, max = 7.41101 ms, mean = 3.87224 ms, median = 3.83044 ms, percentile(90%) = 3.92078 ms, percentile(95%) = 4.03687 ms, percentile(99%) = 4.15466 ms
[12/02/2025-15:17:08] [I] D2H Latency: min = 0.00488281 ms, max = 0.0556641 ms, mean = 0.00609101 ms, median = 0.00585938 ms, percentile(90%) = 0.00695801 ms, percentile(95%) = 0.00714111 ms, percentile(99%) = 0.0161133 ms
[12/02/2025-15:17:08] [I] Total Host Walltime: 3.00717 s
[12/02/2025-15:17:08] [I] Total GPU Compute Time: 2.90031 s
[12/02/2025-15:17:08] [W] * GPU compute time is unstable, with coefficient of variance = 4.51879%.
[12/02/2025-15:17:08] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[12/02/2025-15:17:08] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/02/2025-15:17:08] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v101303] [b9] # trtexec --loadEngine=models/opset16_engines/rtdetr_r50vd_6x_coco_from_paddle_opset16_640_best.engine --useCudaGraph

INT8 (Q/DQ inserted with ModelOptimizer + QAT) :

[12/02/2025-15:18:17] [I] === Performance summary ===
[12/02/2025-15:18:17] [I] Throughput: 252.137 qps
[12/02/2025-15:18:17] [I] Latency: min = 3.78122 ms, max = 11.5471 ms, mean = 5.78844 ms, median = 7.29779 ms, percentile(90%) = 7.94812 ms, percentile(95%) = 7.98499 ms, percentile(99%) = 8.06317 ms
[12/02/2025-15:18:17] [I] Enqueue Time: min = 0.00756836 ms, max = 0.0596924 ms, mean = 0.0172023 ms, median = 0.00982666 ms, percentile(90%) = 0.0404053 ms, percentile(95%) = 0.043335 ms, percentile(99%) = 0.0453491 ms
[12/02/2025-15:18:17] [I] H2D Latency: min = 0.0290527 ms, max = 7.13513 ms, mean = 1.9558 ms, median = 1.9104 ms, percentile(90%) = 3.94324 ms, percentile(95%) = 4.01025 ms, percentile(99%) = 4.07312 ms
[12/02/2025-15:18:17] [I] GPU Compute Time: min = 3.72778 ms, max = 7.09024 ms, mean = 3.82676 ms, median = 3.76208 ms, percentile(90%) = 3.96667 ms, percentile(95%) = 4.01855 ms, percentile(99%) = 4.11511 ms
[12/02/2025-15:18:17] [I] D2H Latency: min = 0.00488281 ms, max = 0.0256348 ms, mean = 0.00588597 ms, median = 0.00561523 ms, percentile(90%) = 0.00683594 ms, percentile(95%) = 0.00720215 ms, percentile(99%) = 0.0146484 ms
[12/02/2025-15:18:17] [I] Total Host Walltime: 3.0063 s
[12/02/2025-15:18:17] [I] Total GPU Compute Time: 2.90068 s
[12/02/2025-15:18:17] [W] * GPU compute time is unstable, with coefficient of variance = 4.51783%.
[12/02/2025-15:18:17] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[12/02/2025-15:18:17] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/02/2025-15:18:17] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v101303] [b9] # trtexec --loadEngine=models/engines/rtdetr_r50vd_6x_coco_from_paddle_640_qdq_best.engine --useCudaGraph

All engines have been generated with a :

trtexec --onnx=<model.onnx> --saveEngine=<model.engine> [--best|--fp16]

Do you have an idea on what could cause this issue ?

thor01@thor01:~/rendu$ dpkg -l | grep "tensorrt\|l4t-kernel\|l4t-core"
ii  nv-tensorrt-local-repo-ubuntu2404-10.14.1-cuda-13.0 1.0-1                                    arm64        nv-tensorrt-local repository configuration files
ii  nvidia-l4t-core                                     38.2.2-20250925153837                    arm64        NVIDIA Core Package
ii  nvidia-l4t-kernel                                   6.8.12-tegra-38.2.2-20250925153837       arm64        NVIDIA Kernel Package
ii  nvidia-l4t-kernel-dtbs                              6.8.12-tegra-38.2.2-20250925153837       arm64        NVIDIA Kernel DTB Package
ii  nvidia-l4t-kernel-headers                           6.8.12-tegra-38.2.2-20250925153837       arm64        NVIDIA Linux Tegra Kernel Headers Package
ii  nvidia-l4t-kernel-module-configs                    6.8.12-tegra-38.2.2-20250925153837       arm64        NVIDIA System-wide Kernel Module Configuration Package
ii  nvidia-l4t-kernel-oot-headers                       6.8.12-tegra-38.2.2-20250925153837       arm64        NVIDIA OOT Kernel Module Headers Package
ii  nvidia-l4t-kernel-oot-modules                       6.8.12-tegra-38.2.2-20250925153837       arm64        NVIDIA OOT Kernel Module Drivers Package
ii  nvidia-l4t-kernel-openrm                            6.8.12-tegra-38.2.2-20250925153837       arm64        NVIDIA Kernel Package containing OpenRM specific files
ii  nvidia-l4t-kernel-partitions                        6.8.12-tegra-38.2.2-20250925153837       arm64        NVIDIA Kernel and Kernel DTB Partition Package
ii  tensorrt                                            10.13.3.9-1+cuda13.0                     arm64        Meta package for TensorRT
ii  tensorrt-libs                                       10.13.3.9-1+cuda13.0                     arm64        Meta package for TensorRT runtime libraries

Thank you for your help.

Cordially,

1 Like

Hi,

There is a report that FP8 doesn’t have the expected performance with TensorRT on Thor:

We will check if the same perf regression issue for the INT8 case.
Thanks.

3 Likes

Thank you for the update, I’ll be waiting for updates then !

Hi,

Thanks for your patience.

Based on your log, the problem size is too small to show the int8 benefit.
Can you try with a large batch size and share the results with us?

Thanks.

Hello,

I’ve been trying with larger batch size and I observe the same behavior :

Name Batchsize Dtype FPS
rtdetr_r50vd_batchsize4_fp32 4 FP32 27.06fps
rtdetr_r50vd_batchsize4_fp16 4 FP16 65.2fps
rtdetr_r50vd_batchsize4_int8 4 INT8 50.2fps
rtdetr_r50vd_batchsize4_best 4 BEST 64.69fps
rtdetr_r50vd_batchsize16_fp32 16 FP32 6.48fps
rtdetr_r50vd_batchsize16_fp16 16 FP16 15.46fps
rtdetr_r50vd_batchsize16_int8 16 INT8 12.2fps
rtdetr_r50vd_batchsize16_best 16 BEST 15.61fps

Cordially,