Hello,
I’ve been playing with the Jetson Thor for a bit and I’ve gotten satisfying results for FP32/FP16 compared to the AGX’s. However, when it comes to INT8 or FP4, I notice that I am not seeing the expected performance gains over FP16.
I’ve mainly tried detections with RT-DETR, but I’ve been observing the same behavior with swine unet, which make me think that the issue might be tied to transformers quantization.
I have tried both implicit and explicit quantization, using either --best option, or doing a QAT after Q/DQ insertion with TensorRT-Model-Optimizer, but in all cases the int8 always ends up about the same or worst than the FP16.
FP16 :
[12/02/2025-15:16:27] [I] === Performance summary ===
[12/02/2025-15:16:27] [I] Throughput: 252.872 qps
[12/02/2025-15:16:27] [I] Latency: min = 3.76587 ms, max = 10.6292 ms, mean = 5.72842 ms, median = 7.44531 ms, percentile(90%) = 7.61548 ms, percentile(95%) = 7.65662 ms, percentile(99%) = 8.13245 ms
[12/02/2025-15:16:27] [I] Enqueue Time: min = 0.00366211 ms, max = 0.043335 ms, mean = 0.010914 ms, median = 0.0057373 ms, percentile(90%) = 0.0274048 ms, percentile(95%) = 0.0280762 ms, percentile(99%) = 0.0321045 ms
[12/02/2025-15:16:27] [I] H2D Latency: min = 0.0289917 ms, max = 4.65234 ms, mean = 1.92862 ms, median = 3.69739 ms, percentile(90%) = 3.82581 ms, percentile(95%) = 3.88184 ms, percentile(99%) = 4.07684 ms
[12/02/2025-15:16:27] [I] GPU Compute Time: min = 3.72095 ms, max = 6.85022 ms, mean = 3.79412 ms, median = 3.75574 ms, percentile(90%) = 3.85913 ms, percentile(95%) = 3.98755 ms, percentile(99%) = 4.11963 ms
[12/02/2025-15:16:27] [I] D2H Latency: min = 0.00463867 ms, max = 0.0262451 ms, mean = 0.00567765 ms, median = 0.00518799 ms, percentile(90%) = 0.0065918 ms, percentile(95%) = 0.00695801 ms, percentile(99%) = 0.0107422 ms
[12/02/2025-15:16:27] [I] Total Host Walltime: 3.00942 s
[12/02/2025-15:16:27] [I] Total GPU Compute Time: 2.88732 s
[12/02/2025-15:16:27] [W] * GPU compute time is unstable, with coefficient of variance = 3.97967%.
[12/02/2025-15:16:27] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[12/02/2025-15:16:27] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/02/2025-15:16:27] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v101303] [b9] # trtexec --loadEngine=models/opset16_engines/rtdetr_r50vd_6x_coco_from_paddle_opset16_640_fp16.engine --useCudaGraph
INT8 (generated with implicit quantization) :
[12/02/2025-15:17:08] [I] === Performance summary ===
[12/02/2025-15:17:08] [I] Throughput: 249.071 qps
[12/02/2025-15:17:08] [I] Latency: min = 3.86841 ms, max = 11.3739 ms, mean = 5.85741 ms, median = 4.26099 ms, percentile(90%) = 7.7793 ms, percentile(95%) = 7.84473 ms, percentile(99%) = 8.24225 ms
[12/02/2025-15:17:08] [I] Enqueue Time: min = 0.00402832 ms, max = 0.0432129 ms, mean = 0.0130806 ms, median = 0.00585938 ms, percentile(90%) = 0.0336914 ms, percentile(95%) = 0.0341797 ms, percentile(99%) = 0.0380859 ms
[12/02/2025-15:17:08] [I] H2D Latency: min = 0.0332031 ms, max = 4.2243 ms, mean = 1.97908 ms, median = 0.110107 ms, percentile(90%) = 3.94336 ms, percentile(95%) = 3.97266 ms, percentile(99%) = 4.1488 ms
[12/02/2025-15:17:08] [I] GPU Compute Time: min = 3.80896 ms, max = 7.41101 ms, mean = 3.87224 ms, median = 3.83044 ms, percentile(90%) = 3.92078 ms, percentile(95%) = 4.03687 ms, percentile(99%) = 4.15466 ms
[12/02/2025-15:17:08] [I] D2H Latency: min = 0.00488281 ms, max = 0.0556641 ms, mean = 0.00609101 ms, median = 0.00585938 ms, percentile(90%) = 0.00695801 ms, percentile(95%) = 0.00714111 ms, percentile(99%) = 0.0161133 ms
[12/02/2025-15:17:08] [I] Total Host Walltime: 3.00717 s
[12/02/2025-15:17:08] [I] Total GPU Compute Time: 2.90031 s
[12/02/2025-15:17:08] [W] * GPU compute time is unstable, with coefficient of variance = 4.51879%.
[12/02/2025-15:17:08] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[12/02/2025-15:17:08] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/02/2025-15:17:08] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v101303] [b9] # trtexec --loadEngine=models/opset16_engines/rtdetr_r50vd_6x_coco_from_paddle_opset16_640_best.engine --useCudaGraph
INT8 (Q/DQ inserted with ModelOptimizer + QAT) :
[12/02/2025-15:18:17] [I] === Performance summary ===
[12/02/2025-15:18:17] [I] Throughput: 252.137 qps
[12/02/2025-15:18:17] [I] Latency: min = 3.78122 ms, max = 11.5471 ms, mean = 5.78844 ms, median = 7.29779 ms, percentile(90%) = 7.94812 ms, percentile(95%) = 7.98499 ms, percentile(99%) = 8.06317 ms
[12/02/2025-15:18:17] [I] Enqueue Time: min = 0.00756836 ms, max = 0.0596924 ms, mean = 0.0172023 ms, median = 0.00982666 ms, percentile(90%) = 0.0404053 ms, percentile(95%) = 0.043335 ms, percentile(99%) = 0.0453491 ms
[12/02/2025-15:18:17] [I] H2D Latency: min = 0.0290527 ms, max = 7.13513 ms, mean = 1.9558 ms, median = 1.9104 ms, percentile(90%) = 3.94324 ms, percentile(95%) = 4.01025 ms, percentile(99%) = 4.07312 ms
[12/02/2025-15:18:17] [I] GPU Compute Time: min = 3.72778 ms, max = 7.09024 ms, mean = 3.82676 ms, median = 3.76208 ms, percentile(90%) = 3.96667 ms, percentile(95%) = 4.01855 ms, percentile(99%) = 4.11511 ms
[12/02/2025-15:18:17] [I] D2H Latency: min = 0.00488281 ms, max = 0.0256348 ms, mean = 0.00588597 ms, median = 0.00561523 ms, percentile(90%) = 0.00683594 ms, percentile(95%) = 0.00720215 ms, percentile(99%) = 0.0146484 ms
[12/02/2025-15:18:17] [I] Total Host Walltime: 3.0063 s
[12/02/2025-15:18:17] [I] Total GPU Compute Time: 2.90068 s
[12/02/2025-15:18:17] [W] * GPU compute time is unstable, with coefficient of variance = 4.51783%.
[12/02/2025-15:18:17] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[12/02/2025-15:18:17] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/02/2025-15:18:17] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v101303] [b9] # trtexec --loadEngine=models/engines/rtdetr_r50vd_6x_coco_from_paddle_640_qdq_best.engine --useCudaGraph
All engines have been generated with a :
trtexec --onnx=<model.onnx> --saveEngine=<model.engine> [--best|--fp16]
Do you have an idea on what could cause this issue ?
thor01@thor01:~/rendu$ dpkg -l | grep "tensorrt\|l4t-kernel\|l4t-core"
ii nv-tensorrt-local-repo-ubuntu2404-10.14.1-cuda-13.0 1.0-1 arm64 nv-tensorrt-local repository configuration files
ii nvidia-l4t-core 38.2.2-20250925153837 arm64 NVIDIA Core Package
ii nvidia-l4t-kernel 6.8.12-tegra-38.2.2-20250925153837 arm64 NVIDIA Kernel Package
ii nvidia-l4t-kernel-dtbs 6.8.12-tegra-38.2.2-20250925153837 arm64 NVIDIA Kernel DTB Package
ii nvidia-l4t-kernel-headers 6.8.12-tegra-38.2.2-20250925153837 arm64 NVIDIA Linux Tegra Kernel Headers Package
ii nvidia-l4t-kernel-module-configs 6.8.12-tegra-38.2.2-20250925153837 arm64 NVIDIA System-wide Kernel Module Configuration Package
ii nvidia-l4t-kernel-oot-headers 6.8.12-tegra-38.2.2-20250925153837 arm64 NVIDIA OOT Kernel Module Headers Package
ii nvidia-l4t-kernel-oot-modules 6.8.12-tegra-38.2.2-20250925153837 arm64 NVIDIA OOT Kernel Module Drivers Package
ii nvidia-l4t-kernel-openrm 6.8.12-tegra-38.2.2-20250925153837 arm64 NVIDIA Kernel Package containing OpenRM specific files
ii nvidia-l4t-kernel-partitions 6.8.12-tegra-38.2.2-20250925153837 arm64 NVIDIA Kernel and Kernel DTB Partition Package
ii tensorrt 10.13.3.9-1+cuda13.0 arm64 Meta package for TensorRT
ii tensorrt-libs 10.13.3.9-1+cuda13.0 arm64 Meta package for TensorRT runtime libraries
Thank you for your help.
Cordially,