Performance discrepancy: TensorRT achieves ~10 TFLOPS vs. 17 TFLOPS spec on Orin Nano (Super mode)

Issue

According to the Jetson Orin Nano datasheet, the module delivers 17 TFLOPS FP16 performance in Super mode (MAXN_SUPER).

However, our TensorRT benchmarks consistently measure ~10 TFLOPS , achieving approximately 60% of the theoretical peak.

Could you confirm whether this is the expected efficiency due to memory bandwidth constraints, or if there are specific TensorRT optimization flags required to approach the rated 17 TFLOPS?

Hardware

Jetson Orin Develop Kit (Official)

Software

Jetpack:6.2 (L4T 36.4.3)
CUDA(in Jetpack):12.6.68
TensorRT(in Jetpack):10.3.0

Benchmark Results

We export a minimal Linear/GEMM layer (N×N @ N×N ) from PyTorch to ONNX, build the TensorRT engine with trtexec --onnx=model.onnx , and benchmark using trtexec --loadEngine=model.engine .
We applied sudo jetson_clocks before conducting the benchmarks, locking GPU and CPU at their maximum clock frequencies.

Performance Results( Computational cost: 2N³ FLOPs)

| N     | Latency (ms) | TFLOPS | % of Peak|
|-------|-------------:|-------:|---------:|
| 1024  | 0.32         | 6.7    | 39%      |
| 2048  | 2.11         | 8.1    | 48%      |
| 4096  | 14.39        | 9.6    | 56%      |
| 8192  | 104.79       | 10.5   | 62%      |

The measured performance plateaus at approximately 10.5 TFLOPS (62% of the specified 17 TFLOPS peak) even for large compute-bound workloads (N=8192), significantly below the theoretical maximum.

Steps to Reproduce (N=8192 for example)

1. Generate .onnx file

import torch
import torch.nn as nn

N = 8192

class GemmModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(N, N, bias=False)
        
    def forward(self, A):
        return self.linear(A)

def main():
    model = GemmModel().eval().cuda()

    A = torch.empty((N, N), dtype=torch.float32, device="cuda")

    onnx_path = f"gemm_{N}_fp32.onnx"

    print(f"Exporting to {onnx_path} ...")

    torch.onnx.export(
        model,
        A,
        onnx_path,
        input_names=["A"],
        output_names=["C"],
        opset_version=17,
        do_constant_folding=False,
        dynamic_axes=None
    )

    print("Done.")

if __name__ == "__main__":
    main()

2. Generate TensorRT .engine file

/usr/src/tensorrt/bin/trtexec \
    --onnx=gemm_8192_fp32.onnx \
    --saveEngine=gemm_8192_fp16.engine \
    --fp16 \
    --verbose

3.BenchMark with TensorRT

sudo jetson_clocks

/usr/src/tensorrt/bin/trtexec \
  --loadEngine=gemm_8192_fp16.engine \
  --iterations=1000 \
  --warmUp=10 \
  --duration=0 \
  --avgRuns=1000 \
  --useSpinWait \
  --noDataTransfers \
  --verbose

get mean latency metric like this from the log

[02/04/2026-16:08:45] [I] Latency: min = 0.317444 ms, max = 0.325504 ms, mean = 0.32109 ms, median = 0.32106 ms, percentile(90%) = 0.322817 ms, percentile(95%) = 0.323357 ms, percentile(99%) = 0.324524 ms

Additional Findings

During the aforementioned benchmarks, the VDD_IN input current hits its limit (reaches maximum draw).

However, the combined power draw of VDD_CPU_GPU_CV (11.7W) and VDD_SOC (5.8W) totals 17.5W, leaving a 6.9W discrepancy (24.4W - 17.5W) compared to the VDD_IN power.

Is this power distribution expected behavior? Could this discrepancy be related to the suboptimal performance observed?

*** Please note that this reply is generated by LLM automatically ***

Based on the provided information, the issue with the Jetson Orin Nano module achieving only approximately 62% of the theoretical peak performance of 17 TFLOPS in Super mode (MAXNSUPER) can be attributed to memory bandwidth constraints.

The Jetson Orin Nano module is designed to deliver high performance, but it is limited by the memory bandwidth. The module’s memory bandwidth is not sufficient to sustain the peak performance of 17 TFLOPS for computationally intensive workloads.

The benchmark results show that the performance plateaus at around 62% of the specified peak, even for large computationally bound workloads. This suggests that the memory bandwidth is the bottleneck, limiting the module’s ability to achieve the theoretical peak performance.

To approach the rated 17 TFLOPS, specific TensorRT optimization flags may be required. However, the provided information does not specify the exact flags or optimizations needed.

Regarding the power distribution, the discrepancy between the combined power draw of 117W and 58W, totaling 175W, and the VDDIN power of 244W is not expected behavior. This discrepancy could be related to the suboptimal performance observed, as it may indicate that the module is not operating within its optimal power range.

To further investigate this issue, it is recommended to:

  1. Check the memory bandwidth utilization during the benchmarks to confirm that it is the bottleneck.
  2. Experiment with different TensorRT optimization flags to see if they can improve performance.
  3. Verify that the module is operating within its optimal power range and that the power distribution is correct.
  4. Consult the Jetson Orin Nano datasheet and technical documentation for more information on optimizing performance and power consumption.

Comparison with other modules:

The Jetson Orin NX series and Jetson Orin Nano series have different performance and power consumption characteristics. The Jetson Orin NX series is designed for more demanding applications and has a higher peak performance of up to 21 TFLOPS. In contrast, the Jetson Orin Nano series is designed for more power-efficient applications and has a lower peak performance of up to 17 TFLOPS.

In terms of memory bandwidth, the Jetson Orin NX series has a higher memory bandwidth of up to 136 GB/s, while the Jetson Orin Nano series has a lower memory bandwidth of up to 68 GB/s. This difference in memory bandwidth may contribute to the performance difference between the two modules.

Overall, the Jetson Orin Nano module’s performance is expected to be lower than the Jetson Orin NX series due to its lower memory bandwidth and peak performance. However, the module’s power efficiency and compact size make it suitable for applications where power consumption is a critical factor.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

Hi,

Could you try our cutlass library below?

The TFLOPS depends on the data size. For example, below is a similar experiment for INT8:

Thanks.

@AastaLLL
Thank you for your reply. I conducted a new benchmark using CUTLASS 2.11 (tuned for approximately 5 hours) and achieved a new FP16 performance peak of 13.1 TFLOPS @ fp32 acc, which reaches 77% of the theoretical maximum. . However, there’s still a noticeable gap to close before hitting the theoretical peak.

Could you advise on how to get better performance, and how to apply these optimal configurations to TensorRT or PyTorch?

The specific operator configurations found during the search are attached.

perf_8192_final.gemm.csv.txt (72.9 KB)

Reproduce Steps

git clone -b v2.11.0  https://github.com/NVIDIA/cutlass.git
cd cutlass/
mkdir build && cd build
cmake ..   -DCUTLASS_NVCC_ARCHS=87   -DCUTLASS_LIBRARY_KERNELS="gemm"   -DCUTLASS_UNITY_BUILD=ON   -DCUTLASS_ENABLE_TESTS=OFF   -DCUTLASS_ENABLE_CUBLAS=OFF   -DCMAKE_BUILD_TYPE=Release   -DCMAKE_CUDA_FLAGS="-O3 -Xptxas -v"
make -j6 cutlass_profiler
cd tools/profiler
./cutlass_profiler --operation=Gemm   --m=8192 --n=8192 --k=8192   --A=f16:row --B=f16:row   --beta=0 --output=perf_8192_final.csv

@AastaLLL

New Investigation: Official Carrier Board Power Design May Be Limiting Nano to 25W, Blocking Full MAXN_SUPER Performance

Investigation 1

According to jetson-orin-nx-series-nano-series-design-guide, achieving the 40W MAXN_SUPER performance on NX requires at least 8V VDD_IN.

We are uncertain whether Nano has similar requirements (no documentation explicitly states this), but currently Nano is only supplied with ~5V.

Attachment:Screenshot from page 18 of the above document

Investigation 2

According to P3768_A04_Concept_schematics.pdf in jetson_orin_nano_devkit_carrier_board_reference_design_files_a04_20230320, the official carrier board for Nano currently limits the power supply to 5V only.

Under the conditions of 5V power supply + 5.1A maximum current, it can indeed only deliver a maximum of 25W power output.

Attachment1:Screenshot from page 8 of the above document

Attachment2:Current Maximum Power: Under 25W @ 4.8V

Investigation 3

According to the official blog, for Nano Super, 25W and MAXN_SUPER are two distinct performance configurations.

Therefore, we have reason to speculate that: the module itself has already removed the power wall limitation, but due to the carrier board power supply issues mentioned above restricting the module’s input power, it is unable to achieve the expected MAXN_SUPER performance.

Attachment:Table 1 of the above document

Hi,

The maximum GPU clocks for 25W and MAXN SUPER are different.
So you can check the maximum clock in your system to double-confirm.

Here is another related post for your reference.
In general, we expect 60-70% SOL of theoretical peak performance.

Thanks.

@AastaLLL
Thanks for the tip! I’ve confirmed MAXN SUPER reaches 1020 MHz on my end.

refer