Unexpected performance for INT8 ResNet50 on Jetson AGX Orin MAXN

Hi,

I’m trying to reproduce the expected INT8 performance for ResNet50-v1-12 on Jetson AGX Orin (power mode MAXN) following the guide here:
https://docs.nvidia.com/deeplearning/tensorrt/latest/performance/best-practices.html

After following all steps, the measured performance is much lower than expected, ~13 TOPS instead of 135 TOPS.

Below are the steps and commands I used

  1. Pulled the image nvcr.io/nvidia/tensorrt:26.01-py3-igpu and created a container.

2. Downloaded the ResNet50-v1-12 ONNX model.

3. Installed the required ModelOpt package:

pip3 install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-modelopt

4. Quantized the model to INT8 using ModelOpt:

python3 -m modelopt.onnx.quantization --onnx_path resnet50-v1-12.onnx --quantize_mode int8 --output_path resnet50-v1-12-quantized.onnx

5. Ran inference using TensorRT:
trtexec --onnx=resnet50-v1-12-quantized.onnx --shapes=data:4x3x224x224 --stronglyTyped --noDataTransfers --useCudaGraph --useSpinWait

Observed result: ~13 TOPS

I would appreciate any guidance on what could be causing this discrepancy, or if there are additional steps I might be missing to reach the expected performance.

Thanks in advance for any help

Hi,

First, please try to maximize the device performance with the following command:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Then could you try to run it with a larger batch size? Like 128?

Thanks.

Hi,

Thanks for the suggestion.

I ran the following commands to maximize the device performance:

sudo nvpmodel -m 0
sudo jetson_clocks

However, this did not change the observed performance.

I also reran the test with a batch size of 128, and the measured performance increased slightly but is still lower than expected, reaching approximately 18 TOPS.

Please let me know if there are any additional steps or checks I should try.

Thanks,
Hadas

Hi
@AastaLLL I am from Developer Relations, I met with @hadas3 since her company is in Inception. I replicated your results and examined it thoroughly:

Short Answer

Your ~13–18 TOPS measurement is not a hardware defect. With the right settings, we pushed ResNet50 INT8 to 42.4 TOPS (5,489 images/sec) — a ~3x improvement. But the 275 TOPS figure on the datasheet is a theoretical hardware peak that no real model can reach. Here’s why, and how to get the most out of your Orin.


How We Achieved 42.4 TOPS (vs your ~13–18 TOPS)

1. Prerequisites on the host (must run after every reboot)

sudo nvpmodel -m 0      # MAXN power mode
sudo jetson_clocks       # lock all clocks to maximum

Verify with:

sudo jetson_clocks --show   # GPU should show 1300500000 Hz
nvpmodel -q                 # should show MAXN

2. Build the engine with aggressive optimization

trtexec --onnx=resnet50-v1-12.onnx \
    --saveEngine=resnet50_int8_b128.engine \
    --int8 --fp16 \
    --sparsity=force \
    --shapes=data:128x3x224x224 \
    --builderOptimizationLevel=5 \
    --avgTiming=8 \
    --precisionConstraints=prefer \
    --timingCacheFile=timing.cache \
    --duration=0 --iterations=1

Key differences from your original command:

  • --shapes=data:128x3x224x224 — batch 128 instead of 4. Small batches severely underutilize tensor cores.
  • --sparsity=force — enables 2:4 structured sparsity on tensor cores (2x throughput for eligible layers).
  • --builderOptimizationLevel=5 — deepest tactic search.
  • --precisionConstraints=prefer — prefer INT8 where possible.
  • --timingCacheFile — caches tactic profiling so subsequent builds are fast.
  • Removed --stronglyTyped — this can prevent TensorRT from choosing optimal mixed-precision tactics.

3. Benchmark with pipelined inference

trtexec --loadEngine=resnet50_int8_b128.engine \
    --shapes=data:128x3x224x224 \
    --noDataTransfers \
    --useCudaGraph \
    --useSpinWait \
    --warmUp=500 \
    --duration=30 \
    --infStreams=4

Key flags:

  • --infStreams=4 — runs 4 inference streams in parallel to keep the GPU pipeline full.
  • --noDataTransfers — measures pure compute (excludes host-to-device copies).
  • --useCudaGraph — reduces kernel launch overhead to near zero.
  • --useSpinWait — avoids OS scheduler latency between inferences.
  • --warmUp=500 — 500ms warmup before measurement.

Our result

Setting Images/sec TOPS
Your original (batch=4, stronglyTyped) ~1,700 ~13
Your retest (batch=128) ~2,300 ~18
Our optimized (batch=128, sparsity, 4 streams) 5,489 42.4

Why 275 TOPS Is Not Achievable with Any Real Model

The 275 TOPS is a theoretical hardware peak calculated from architecture specifications. It is not a benchmark result and has never been demonstrated with any workload.

The exact calculation

275 TOPS = GPU Tensor Cores (170 TOPS) + 2x DLA engines (105 TOPS)

GPU: 64 Tensor Cores x 1,024 INT8 ops/clock x 1.3 GHz x 2 (sparsity) = 170 TOPS
DLA: 2x NVDLA 2.0 at 1.6 GHz = ~105 TOPS

This assumes:

  • 100% tensor core utilization every clock cycle
  • Perfect 2:4 structured sparsity across all weights
  • Zero time spent on anything other than INT8 multiply-accumulate
  • GPU and both DLAs running simultaneously at peak

Why real models fall short

Factor Impact
Memory bandwidth Orin has 204.8 GB/s. Tensor cores can consume data faster than memory can supply it. ResNet50 is memory-bandwidth-bound.
Non-tensor-core operations Pooling, batch norm, element-wise adds, softmax — these don’t use tensor cores.
Layer transitions Data reformats between layers consume time.
Sparsity assumption The 2x sparsity multiplier assumes all weights follow the 2:4 pattern. Dense models (like the original ResNet50 ONNX) get at most 85 TOPS (half of 170).
DLA GPU fallback DLA cannot run all layers natively. Layers that fall back to GPU create contention, actually reducing total throughput vs GPU-only.

What the best synthetic benchmarks achieve

Even the most favorable workload (a CUTLASS sparse INT8 GEMM — a pure matrix multiply, not a real model) achieves only ~99 TOPS, or 58% of the 170 GPU peak. NVIDIA’s own internal target for these synthetic kernels is 60–70% of peak.

Realistic expectations for ResNet50

Metric Value
GPU theoretical sparse peak 170 TOPS
GPU theoretical dense peak 85 TOPS
Best synthetic GEMM ~99 TOPS (58%)
ResNet50 INT8 optimized 42.4 TOPS (50% of dense peak)
ResNet50 INT8 unoptimized (batch=4) ~13 TOPS

Your 42.4 TOPS represents 50% of the dense GPU peak (85 TOPS), which is good utilization for a real CNN workload.


Summary

  1. Your hardware is fine. The Orin is performing as expected.
  2. Batch size matters most. Batch=4 leaves the tensor cores mostly idle. Use batch=128+.
  3. Use all the trtexec optimizations listed above to go from ~13 to ~42 TOPS.
  4. 275 TOPS is a calculated hardware spec, like a car’s top speed — useful for comparison, but not achievable on the road.
  5. 42 TOPS on ResNet50 is solid — it’s 50% of the dense GPU peak, and ~25% of the theoretical sparse peak including DLA.

Hope this helps clarify. Happy to answer follow-up questions.

Hi,

Thank you so much for the detailed explanation.

I followed the same steps you described, including maximizing performance (nvpmodel -m 0, jetson_clocks), using a larger batch size, and all the other optimization flags. However, I am seeing a GPU compute time of ~23 ms.

When I tried running with 4 inference streams, the total GPU compute time increased to approximately 98 ms, which is still around 24 ms per image. The throughput remains around 40 images/sec.

Based on this runtime, my own calculation results in roughly ~20 TOPS.

Could you please clarify how you derived the images/sec value, or suggest what else could be affecting the performance?

Thanks again for your help,
Hadas