Hi
@AastaLLL I am from Developer Relations, I met with @hadas3 since her company is in Inception. I replicated your results and examined it thoroughly:
Short Answer
Your ~13–18 TOPS measurement is not a hardware defect. With the right settings, we pushed ResNet50 INT8 to 42.4 TOPS (5,489 images/sec) — a ~3x improvement. But the 275 TOPS figure on the datasheet is a theoretical hardware peak that no real model can reach. Here’s why, and how to get the most out of your Orin.
How We Achieved 42.4 TOPS (vs your ~13–18 TOPS)
1. Prerequisites on the host (must run after every reboot)
sudo nvpmodel -m 0 # MAXN power mode
sudo jetson_clocks # lock all clocks to maximum
Verify with:
sudo jetson_clocks --show # GPU should show 1300500000 Hz
nvpmodel -q # should show MAXN
2. Build the engine with aggressive optimization
trtexec --onnx=resnet50-v1-12.onnx \
--saveEngine=resnet50_int8_b128.engine \
--int8 --fp16 \
--sparsity=force \
--shapes=data:128x3x224x224 \
--builderOptimizationLevel=5 \
--avgTiming=8 \
--precisionConstraints=prefer \
--timingCacheFile=timing.cache \
--duration=0 --iterations=1
Key differences from your original command:
--shapes=data:128x3x224x224 — batch 128 instead of 4. Small batches severely underutilize tensor cores.
--sparsity=force — enables 2:4 structured sparsity on tensor cores (2x throughput for eligible layers).
--builderOptimizationLevel=5 — deepest tactic search.
--precisionConstraints=prefer — prefer INT8 where possible.
--timingCacheFile — caches tactic profiling so subsequent builds are fast.
- Removed
--stronglyTyped — this can prevent TensorRT from choosing optimal mixed-precision tactics.
3. Benchmark with pipelined inference
trtexec --loadEngine=resnet50_int8_b128.engine \
--shapes=data:128x3x224x224 \
--noDataTransfers \
--useCudaGraph \
--useSpinWait \
--warmUp=500 \
--duration=30 \
--infStreams=4
Key flags:
--infStreams=4 — runs 4 inference streams in parallel to keep the GPU pipeline full.
--noDataTransfers — measures pure compute (excludes host-to-device copies).
--useCudaGraph — reduces kernel launch overhead to near zero.
--useSpinWait — avoids OS scheduler latency between inferences.
--warmUp=500 — 500ms warmup before measurement.
Our result
| Setting |
Images/sec |
TOPS |
| Your original (batch=4, stronglyTyped) |
~1,700 |
~13 |
| Your retest (batch=128) |
~2,300 |
~18 |
| Our optimized (batch=128, sparsity, 4 streams) |
5,489 |
42.4 |
Why 275 TOPS Is Not Achievable with Any Real Model
The 275 TOPS is a theoretical hardware peak calculated from architecture specifications. It is not a benchmark result and has never been demonstrated with any workload.
The exact calculation
275 TOPS = GPU Tensor Cores (170 TOPS) + 2x DLA engines (105 TOPS)
GPU: 64 Tensor Cores x 1,024 INT8 ops/clock x 1.3 GHz x 2 (sparsity) = 170 TOPS
DLA: 2x NVDLA 2.0 at 1.6 GHz = ~105 TOPS
This assumes:
- 100% tensor core utilization every clock cycle
- Perfect 2:4 structured sparsity across all weights
- Zero time spent on anything other than INT8 multiply-accumulate
- GPU and both DLAs running simultaneously at peak
Why real models fall short
| Factor |
Impact |
| Memory bandwidth |
Orin has 204.8 GB/s. Tensor cores can consume data faster than memory can supply it. ResNet50 is memory-bandwidth-bound. |
| Non-tensor-core operations |
Pooling, batch norm, element-wise adds, softmax — these don’t use tensor cores. |
| Layer transitions |
Data reformats between layers consume time. |
| Sparsity assumption |
The 2x sparsity multiplier assumes all weights follow the 2:4 pattern. Dense models (like the original ResNet50 ONNX) get at most 85 TOPS (half of 170). |
| DLA GPU fallback |
DLA cannot run all layers natively. Layers that fall back to GPU create contention, actually reducing total throughput vs GPU-only. |
What the best synthetic benchmarks achieve
Even the most favorable workload (a CUTLASS sparse INT8 GEMM — a pure matrix multiply, not a real model) achieves only ~99 TOPS, or 58% of the 170 GPU peak. NVIDIA’s own internal target for these synthetic kernels is 60–70% of peak.
Realistic expectations for ResNet50
| Metric |
Value |
| GPU theoretical sparse peak |
170 TOPS |
| GPU theoretical dense peak |
85 TOPS |
| Best synthetic GEMM |
~99 TOPS (58%) |
| ResNet50 INT8 optimized |
42.4 TOPS (50% of dense peak) |
| ResNet50 INT8 unoptimized (batch=4) |
~13 TOPS |
Your 42.4 TOPS represents 50% of the dense GPU peak (85 TOPS), which is good utilization for a real CNN workload.
Summary
- Your hardware is fine. The Orin is performing as expected.
- Batch size matters most. Batch=4 leaves the tensor cores mostly idle. Use batch=128+.
- Use all the trtexec optimizations listed above to go from ~13 to ~42 TOPS.
- 275 TOPS is a calculated hardware spec, like a car’s top speed — useful for comparison, but not achievable on the road.
- 42 TOPS on ResNet50 is solid — it’s 50% of the dense GPU peak, and ~25% of the theoretical sparse peak including DLA.
Hope this helps clarify. Happy to answer follow-up questions.