Unexpected performance for INT8 ResNet50 on Jetson AGX Orin MAXN

hadas3 · February 10, 2026, 2:45pm

Hi,

I’m trying to reproduce the expected INT8 performance for ResNet50-v1-12 on Jetson AGX Orin (power mode MAXN) following the guide here:
https://docs.nvidia.com/deeplearning/tensorrt/latest/performance/best-practices.html

After following all steps, the measured performance is much lower than expected, ~13 TOPS instead of 135 TOPS.

Below are the steps and commands I used

Pulled the image nvcr.io/nvidia/tensorrt:26.01-py3-igpu and created a container.

2. Downloaded the ResNet50-v1-12 ONNX model.

3. Installed the required ModelOpt package:

pip3 install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-modelopt

4. Quantized the model to INT8 using ModelOpt:

python3 -m modelopt.onnx.quantization --onnx_path resnet50-v1-12.onnx --quantize_mode int8 --output_path resnet50-v1-12-quantized.onnx

5. Ran inference using TensorRT:
trtexec --onnx=resnet50-v1-12-quantized.onnx --shapes=data:4x3x224x224 --stronglyTyped --noDataTransfers --useCudaGraph --useSpinWait

Observed result: ~13 TOPS

I would appreciate any guidance on what could be causing this discrepancy, or if there are additional steps I might be missing to reach the expected performance.

Thanks in advance for any help

AastaLLL · February 11, 2026, 7:40am

Hi,

First, please try to maximize the device performance with the following command:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Then could you try to run it with a larger batch size? Like 128?

Thanks.

hadas3 · February 15, 2026, 9:45am

Hi,

Thanks for the suggestion.

I ran the following commands to maximize the device performance:

sudo nvpmodel -m 0
sudo jetson_clocks

However, this did not change the observed performance.

I also reran the test with a batch size of 128, and the measured performance increased slightly but is still lower than expected, reaching approximately 18 TOPS.

Please let me know if there are any additional steps or checks I should try.

Thanks,
Hadas

guyda · February 23, 2026, 11:30am

Hi
@AastaLLL I am from Developer Relations, I met with @hadas3 since her company is in Inception. I replicated your results and examined it thoroughly:

Short Answer

Your ~13–18 TOPS measurement is not a hardware defect. With the right settings, we pushed ResNet50 INT8 to 42.4 TOPS (5,489 images/sec) — a ~3x improvement. But the 275 TOPS figure on the datasheet is a theoretical hardware peak that no real model can reach. Here’s why, and how to get the most out of your Orin.

How We Achieved 42.4 TOPS (vs your ~13–18 TOPS)

1. Prerequisites on the host (must run after every reboot)

sudo nvpmodel -m 0      # MAXN power mode
sudo jetson_clocks       # lock all clocks to maximum

Verify with:

sudo jetson_clocks --show   # GPU should show 1300500000 Hz
nvpmodel -q                 # should show MAXN

2. Build the engine with aggressive optimization

trtexec --onnx=resnet50-v1-12.onnx \
    --saveEngine=resnet50_int8_b128.engine \
    --int8 --fp16 \
    --sparsity=force \
    --shapes=data:128x3x224x224 \
    --builderOptimizationLevel=5 \
    --avgTiming=8 \
    --precisionConstraints=prefer \
    --timingCacheFile=timing.cache \
    --duration=0 --iterations=1

Key differences from your original command:

--shapes=data:128x3x224x224 — batch 128 instead of 4. Small batches severely underutilize tensor cores.
--sparsity=force — enables 2:4 structured sparsity on tensor cores (2x throughput for eligible layers).
--builderOptimizationLevel=5 — deepest tactic search.
--precisionConstraints=prefer — prefer INT8 where possible.
--timingCacheFile — caches tactic profiling so subsequent builds are fast.
Removed --stronglyTyped — this can prevent TensorRT from choosing optimal mixed-precision tactics.

3. Benchmark with pipelined inference

trtexec --loadEngine=resnet50_int8_b128.engine \
    --shapes=data:128x3x224x224 \
    --noDataTransfers \
    --useCudaGraph \
    --useSpinWait \
    --warmUp=500 \
    --duration=30 \
    --infStreams=4

Key flags:

--infStreams=4 — runs 4 inference streams in parallel to keep the GPU pipeline full.
--noDataTransfers — measures pure compute (excludes host-to-device copies).
--useCudaGraph — reduces kernel launch overhead to near zero.
--useSpinWait — avoids OS scheduler latency between inferences.
--warmUp=500 — 500ms warmup before measurement.

Our result

Setting	Images/sec	TOPS
Your original (batch=4, stronglyTyped)	~1,700	~13
Your retest (batch=128)	~2,300	~18
Our optimized (batch=128, sparsity, 4 streams)	5,489	42.4

Why 275 TOPS Is Not Achievable with Any Real Model

The 275 TOPS is a theoretical hardware peak calculated from architecture specifications. It is not a benchmark result and has never been demonstrated with any workload.

The exact calculation

275 TOPS = GPU Tensor Cores (170 TOPS) + 2x DLA engines (105 TOPS)

GPU: 64 Tensor Cores x 1,024 INT8 ops/clock x 1.3 GHz x 2 (sparsity) = 170 TOPS
DLA: 2x NVDLA 2.0 at 1.6 GHz = ~105 TOPS

This assumes:

100% tensor core utilization every clock cycle
Perfect 2:4 structured sparsity across all weights
Zero time spent on anything other than INT8 multiply-accumulate
GPU and both DLAs running simultaneously at peak

Why real models fall short

Factor	Impact
Memory bandwidth	Orin has 204.8 GB/s. Tensor cores can consume data faster than memory can supply it. ResNet50 is memory-bandwidth-bound.
Non-tensor-core operations	Pooling, batch norm, element-wise adds, softmax — these don’t use tensor cores.
Layer transitions	Data reformats between layers consume time.
Sparsity assumption	The 2x sparsity multiplier assumes all weights follow the 2:4 pattern. Dense models (like the original ResNet50 ONNX) get at most 85 TOPS (half of 170).
DLA GPU fallback	DLA cannot run all layers natively. Layers that fall back to GPU create contention, actually reducing total throughput vs GPU-only.

What the best synthetic benchmarks achieve

Even the most favorable workload (a CUTLASS sparse INT8 GEMM — a pure matrix multiply, not a real model) achieves only ~99 TOPS, or 58% of the 170 GPU peak. NVIDIA’s own internal target for these synthetic kernels is 60–70% of peak.

Realistic expectations for ResNet50

Metric	Value
GPU theoretical sparse peak	170 TOPS
GPU theoretical dense peak	85 TOPS
Best synthetic GEMM	~99 TOPS (58%)
ResNet50 INT8 optimized	42.4 TOPS (50% of dense peak)
ResNet50 INT8 unoptimized (batch=4)	~13 TOPS

Your 42.4 TOPS represents 50% of the dense GPU peak (85 TOPS), which is good utilization for a real CNN workload.

Summary

Your hardware is fine. The Orin is performing as expected.
Batch size matters most. Batch=4 leaves the tensor cores mostly idle. Use batch=128+.
Use all the trtexec optimizations listed above to go from ~13 to ~42 TOPS.
275 TOPS is a calculated hardware spec, like a car’s top speed — useful for comparison, but not achievable on the road.
42 TOPS on ResNet50 is solid — it’s 50% of the dense GPU peak, and ~25% of the theoretical sparse peak including DLA.

Hope this helps clarify. Happy to answer follow-up questions.

hadas3 · February 24, 2026, 2:27pm

Hi,

Thank you so much for the detailed explanation.

I followed the same steps you described, including maximizing performance (nvpmodel -m 0, jetson_clocks), using a larger batch size, and all the other optimization flags. However, I am seeing a GPU compute time of ~23 ms.

When I tried running with 4 inference streams, the total GPU compute time increased to approximately 98 ms, which is still around 24 ms per image. The throughput remains around 40 images/sec.

Based on this runtime, my own calculation results in roughly ~20 TOPS.

Could you please clarify how you derived the images/sec value, or suggest what else could be affecting the performance?

Thanks again for your help,
Hadas

guyda · March 16, 2026, 9:11am

Hi Hadas,

Good news — your ~23 ms GPU compute time per batch is actually consistent with our results. The issue is how trtexec reports its numbers.

QPS ≠ Images/sec

trtexec reports Throughput in QPS (queries per second), where each query is one batch — not one image. With batch=128:

Images/sec = QPS × batch_size

So when trtexec prints Throughput: ~42 qps:

42 batches/sec (what trtexec reports)
42 × 128 = 5,376 images/sec (the actual image throughput)

I suspect the “~40 images/sec” you’re seeing is actually the Throughput: line showing ~40 qps, which means ~40 × 128 = ~5,120 images/sec. That’s right in line with our result.

Your ~23 ms Confirms This

With a single stream and 23 ms per batch of 128:

1000 ms / 23 ms = ~43 batches/sec × 128 = ~5,500 images/sec

This matches our 5,489 images/sec. Your hardware is performing well.

Why Latency Goes Up with 4 Streams

With --infStreams=4, trtexec pipelines 4 batches concurrently. The reported per-batch latency increases (23 ms → 98 ms) because each batch waits in the queue behind the others. But total throughput stays the same (or slightly improves) because the GPU is always busy — it’s just processing them in parallel.

When using multiple streams, always read throughput from the Throughput: line. Don’t derive it from 1000 / latency — that only works with a single stream.

TOPS Recalculation

If we plug in the correct images/sec:

~5,500 images/sec × 7.7 GOPS (ResNet50 INT8 ops) = ~42 TOPS

Quick Ask

Could you paste the exact trtexec benchmark output — specifically the lines starting with Throughput:, Latency:, and GPU Compute Time:? That will let me confirm everything lines up on your side.

Thanks

Topic		Replies	Views
TensorRT 8.6 Performance Issue in AGX Orin 32Gb Jetson AGX Orin tensorrt	8	675	February 27, 2024
Can't run nvcr.io/nvidia/l4t-tensorrt:r8.2.1-runtime on Orin AGX Jetson AGX Orin tensorrt	18	1427	May 13, 2022
TensorRT Model Optimizer INT8 quantization causes 2.7x performance regression on Jetson Orin Nano 4GB (ViT-S + DPT architecture) Jetson Orin Nano tensorrt , nvbugs	8	425	February 13, 2026
Jetson Thor AGX - Poor INT8 performance Jetson Thor tensorrt , jetson-inference	6	503	April 1, 2026
Orin low performance on mobilnetv1 ssd Jetson AGX Orin jetson-inference	6	1362	June 1, 2022
Inference slow even using TensorRT Jetson AGX Orin tensorrt	14	2556	November 6, 2023
Resnet results on AGX TensorRT tensorrt	1	437	February 17, 2022
The token speed of qwen 2.5 vl 3b model is very lower on Jeston AGX Orin Jetson AGX Orin generative_ai	2	617	September 22, 2025
Trtexec performance not close to benchmarks Jetson Orin NX tensorrt	1	570	December 19, 2023
TensorRT examples TensorRT tensorrt , cuda , tensorflow , cudnn , tensorrt-model-optimizer	1	125	February 28, 2025