Issue
According to the Jetson Orin Nano datasheet, the module delivers 17 TFLOPS FP16 performance in Super mode (MAXN_SUPER).
However, our TensorRT benchmarks consistently measure ~10 TFLOPS , achieving approximately 60% of the theoretical peak.
Could you confirm whether this is the expected efficiency due to memory bandwidth constraints, or if there are specific TensorRT optimization flags required to approach the rated 17 TFLOPS?
Hardware
Jetson Orin Develop Kit (Official)
Software
Jetpack:6.2 (L4T 36.4.3)
CUDA(in Jetpack):12.6.68
TensorRT(in Jetpack):10.3.0
Benchmark Results
We export a minimal Linear/GEMM layer (N×N @ N×N ) from PyTorch to ONNX, build the TensorRT engine with trtexec --onnx=model.onnx , and benchmark using trtexec --loadEngine=model.engine .
We applied sudo jetson_clocks before conducting the benchmarks, locking GPU and CPU at their maximum clock frequencies.
Performance Results( Computational cost: 2N³ FLOPs)
| N | Latency (ms) | TFLOPS | % of Peak|
|-------|-------------:|-------:|---------:|
| 1024 | 0.32 | 6.7 | 39% |
| 2048 | 2.11 | 8.1 | 48% |
| 4096 | 14.39 | 9.6 | 56% |
| 8192 | 104.79 | 10.5 | 62% |
The measured performance plateaus at approximately 10.5 TFLOPS (62% of the specified 17 TFLOPS peak) even for large compute-bound workloads (N=8192), significantly below the theoretical maximum.
Steps to Reproduce (N=8192 for example)
1. Generate .onnx file
import torch
import torch.nn as nn
N = 8192
class GemmModel(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(N, N, bias=False)
def forward(self, A):
return self.linear(A)
def main():
model = GemmModel().eval().cuda()
A = torch.empty((N, N), dtype=torch.float32, device="cuda")
onnx_path = f"gemm_{N}_fp32.onnx"
print(f"Exporting to {onnx_path} ...")
torch.onnx.export(
model,
A,
onnx_path,
input_names=["A"],
output_names=["C"],
opset_version=17,
do_constant_folding=False,
dynamic_axes=None
)
print("Done.")
if __name__ == "__main__":
main()
2. Generate TensorRT .engine file
/usr/src/tensorrt/bin/trtexec \
--onnx=gemm_8192_fp32.onnx \
--saveEngine=gemm_8192_fp16.engine \
--fp16 \
--verbose
3.BenchMark with TensorRT
sudo jetson_clocks
/usr/src/tensorrt/bin/trtexec \
--loadEngine=gemm_8192_fp16.engine \
--iterations=1000 \
--warmUp=10 \
--duration=0 \
--avgRuns=1000 \
--useSpinWait \
--noDataTransfers \
--verbose
get mean latency metric like this from the log
[02/04/2026-16:08:45] [I] Latency: min = 0.317444 ms, max = 0.325504 ms, mean = 0.32109 ms, median = 0.32106 ms, percentile(90%) = 0.322817 ms, percentile(95%) = 0.323357 ms, percentile(99%) = 0.324524 ms
Additional Findings
During the aforementioned benchmarks, the VDD_IN input current hits its limit (reaches maximum draw).
However, the combined power draw of VDD_CPU_GPU_CV (11.7W) and VDD_SOC (5.8W) totals 17.5W, leaving a 6.9W discrepancy (24.4W - 17.5W) compared to the VDD_IN power.
Is this power distribution expected behavior? Could this discrepancy be related to the suboptimal performance observed?




