Low ViT Performance Gain on Jetson Thor Using FP8 vs FP16

ours.magenta · October 28, 2025, 4:26pm

Hello,

Looking at the documentation, to enable fp8 operations you need some onnx surgery (inserting Q/DQ at specific locations) to trigger the right MHA (Multi-Head Attention) fusion in conjunction with fp8 precision.

However, the performance improvement is quite low for base ViT model (~20% latency reduction). It is even worse on the EfficientSAM encoder with basically no gain.

By looking at the profiling and layer info from TensorRT the FP8 seems there (even though some tactics are quite cryptic, especially the gmm_mha_v2_#weirdbitstream).

Environment

TensorRT Version: 10.13.3
NVIDIA GPU: Thor (Jetson DevKit)
NVIDIA Driver Version: 580.00
CUDA Version: 13

Relevant Files

Model link: EfficientSAM-S
Model link: ViT-Base

Steps To Reproduce

Model Optimizer → commit

ViT-Base FP8 onnx generation:

python3 -m modelopt.onnx.quantization --onnx_path=./vit_base_patch8_224_Opset17.onnx --quantize_mode=fp8 --output_path=./vitb_fp8.onnx

EfficientSAM-S FP8 onnx generation:

python3 -m modelopt.onnx.quantization --onnx_path=./efficientsam_s_encoder.onnx --quantize_mode=fp8 --output_path=./sam_s_fp8.onnx

ViT-Base FP8 engine generation:

trtexec --stronglyTyped --onnx=./vitb_fp8.onnx --saveEngine=./vitb_fp8.engine

ViT-Base FP8 engine generation:

trtexec --stronglyTyped --onnx=./sam_s_fp8.onnx --saveEngine=./sam_s_fp8.engine

efficientsam_s_encoder_fp8.profile.txt (14.3 KB)

efficientsam_s_encoder_fp16.profile.txt (14.0 KB)

vit_base_patch8_224_Opset17_fp8.profile.txt (13.7 KB)

vit_base_patch8_224_Opset17_fp16.profile.txt (13.3 KB)

profiles_and_layerinfo.zip (28.9 KB)

AastaLLL · October 29, 2025, 3:19am

Hi,

Thanks for reporting this.
We will try it locally and update you with more information.

AastaLLL · October 30, 2025, 7:24am

Hi,

In our test, for EfficientSAM:

FP16: 140.689 qps
FP8: 166.332 qps

Does this align with your experiment?
We set the device to maximum performance before the testing:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

ours.magenta · October 30, 2025, 7:45am

Does this align with your experiment?

Yes pretty much.

For EfficientSAM (S), with the MAXN profile I have:

147 fps for the fp16 engine
172 fps for the fp8 engine

The gain is ~17% on my part and ~18% on yours.

AastaLLL · November 3, 2025, 8:42am

Hi,

Thanks for the update.

We are gathering more information about this issue with our internal team.
Will update more information with you later

AastaLLL · December 1, 2025, 2:13am

Hi,

Thanks for your patience.

We confirm that this is a perf bug and our internal team is working on the fix.
Will keep updating the status with you.

Thanks.

ours.magenta · December 1, 2025, 9:10am

Thanks for the update. Reassuring to hear that a fix is on its way !

quan.luo.101 · December 15, 2025, 7:26pm

Hi, I have similar results for nvfp4 inference on Jetson Thor. With batch size 128, the nvfp4 inference latency is even larger than fp16 inference. The model I used is ViT-large, very similar to ViT-base in the post here.

And on RTX-6000 Blackwell, nvfp4 quantization can make the model much faster.

Is it also the same bug?

quan.luo.101 · December 15, 2025, 7:30pm

In addition, for ViT-Large, it has 300M parameters.

So theoretically, using nvfp4 rather than fp16, we shall have 300 * 1.5 = 450MB GPU memory saved.

However on thor, it’s only ~200MB. While on RTX-6000 Blackwell, the number looks much better.

1457689744 · December 17, 2025, 1:29am

Hi, may I ask if Thor supports nvfp4 or mxfp4 or both？

AastaLLL · December 22, 2025, 6:04am

Please find the reply in the topic below:

Thanks.

ours.magenta · January 14, 2026, 10:11am

Hello,

Not sure I understand why this comes as a solution for this topic.

Could you expand on the reason it is marked as solution ?

Best,

system · January 28, 2026, 10:11am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Jetson Thor - INT8 quantization show no performance gain over FP16 Jetson Thor tensorrt , jetson-inference , tensorrt-model-optimizer	7	217	January 26, 2026
NVFP4 Performance Issue Jetson Thor llm	10	149	January 7, 2026
FP8 series models hosting with the official 2509 vLLM consistently produces garbled output Jetson Thor generative_ai	5	119	November 18, 2025
Inference Speed Jetson Xavier NX pytorch	6	1033	April 12, 2023
No performance improvement on Jetson Nano FP16 vs FP32 TensorRT	6	2807	February 22, 2021
[Hugging Face transformer models + pytorch_quantization] PTQ quantization int8 is slower than fp16 TensorRT tensorrt , python , onnx , natural-language-processing-nlp	4	3134	January 6, 2022
Int8 is not faster than fp16 on xavier Jetson AGX Xavier tensorrt	5	851	October 18, 2021
Thor torch.mm benchmark results (float32/float16/float8_e3m2fn) Jetson Thor cuda , pytorch , benchmarks	5	282	September 15, 2025
INT8 throughput and latency worse than FP16 for MiDas DPT Hybrid model on Thor TensorRT tensorrt	3	61	January 6, 2026
How can we know we have convert the onnx to int8trt rather than Float32? TensorRT tensorrt	23	2072	June 14, 2021

Low ViT Performance Gain on Jetson Thor Using FP8 vs FP16

Environment

Relevant Files

Steps To Reproduce

Related topics