What is the peak FP32 performance of Orin AGX?

df23r23r · October 31, 2024, 5:48pm

Hi all,
What is the peak FP32 performance of Orin AGX?

In the document I saw 5.3 TFLOPS https://www.nvidia.com/content/dam/en-zz/Solutions/gtcf21/jetson-orin/nvidia-jetson-agx-orin-technical-brief.pdf

But in my test, I got 15 TFLOPS
I use sudo nvpmodel -m 0 and sudo jetson_clocks

I used two kinds of the code to make test:

import torch
from torch.utils import benchmark

typ = torch.float16
n = 1024 * 4
a = torch.randn(n, n).type(typ).cuda()
b = torch.randn(n, n).type(typ).cuda()

t = benchmark.Timer(
      stmt='a @ b',
      globals={'a': a, 'b': b})

x = t.timeit(50)
print('float16 ,',2*n**3 / x.median /1e12)


typ = torch.float32
n = 1024 * 4
a = torch.randn(n, n).type(typ).cuda()
b = torch.randn(n, n).type(typ).cuda()

t = benchmark.Timer(
              stmt='a @ b',
                    globals={'a': a, 'b': b})

x = t.timeit(50)
print('float32 ,',2*n**3 / x.median /1e12)

I got:
root@veagxorin-desktop:~# python peaktorch.py
float16 , 22.82726626095605
float32 , 14.871669421596716

I also tried this code

import torch
import time

def measure_tflops(device, dtype, size=4096, iterations=50):
    # Create random matrices
    a = torch.rand(size, size, device=device, dtype=dtype)
    b = torch.rand(size, size, device=device, dtype=dtype)

    # Warm up
    for _ in range(10):
        torch.mm(a, b)

    torch.cuda.synchronize(device)

    # Benchmark
    start_time = time.time()
    for _ in range(iterations):
        torch.mm(a, b)
    torch.cuda.synchronize(device)
    end_time = time.time()

    # Calculate TFLOPS
    elapsed_time = end_time - start_time
    operations_per_matrix_multiplication = 2 * size ** 3  # 2 * n^3 FLOPs for matrix multiplication
    total_operations = operations_per_matrix_multiplication * iterations
    tflops = (total_operations / elapsed_time) / 1e12

    return tflops

# Set the device to GPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
dtype = torch.float32  # TF32 is used like float32 in PyTorch on compatible GPUs

# Measure TFLOPS
tflops = measure_tflops(device, dtype)
print(f"Peak TFLOPS: {tflops}")

And I got

root@veagxorin-desktop:~# python peak.py
Peak TFLOPS: 12.962248670484703

@Jetson AGX Orin TOPs / CUDA Cores Explained - #8 by xuyi19920502
at the end of this post, they also said the measured FP16 to be 20 + TFLOPs and way above the theoretical peak, too.
Is there any possible explain of that? Thanks!

(This post is a repost of Why I get much higher TFLOPS in Orin AGX than what claimed in the document - #4 by df23r23r, as it seems like I posted to the wrong category; thanks!)

kayccc · November 1, 2024, 12:23am

Duplicated with Why I get much higher TFLOPS in Orin AGX than what claimed in the document - Jetson & Embedded Systems / Jetson AGX Orin - NVIDIA Developer Forums

Topic		Replies	Views
Why I get much higher TFLOPS in Orin AGX than what claimed in the document IGX Orin Support kernel , jetson-inference , documentation	7	552	November 4, 2024
Jetson AGX Orin TOPs / CUDA Cores Explained Jetson AGX Orin jetson-inference	8	7596	May 24, 2023
The tensor core performance detail of Jetson AGX Orin 32GB Jetson AGX Orin	14	1508	June 13, 2023
NVIDIA Orin Performance Jetson AGX Orin tensorrt	3	475	October 14, 2024
Peak FP32 FLOP/s of AGX Orin Jetson AGX Orin performance	4	330	April 14, 2025
How many 16FP Tops for agx-orin 32G Jetson AGX Orin jetson-inference	6	3415	July 13, 2022
Jetson Orin AI Performance Jetson Orin Nano documentation	7	1694	February 21, 2023
Inference slow even using TensorRT Jetson AGX Orin tensorrt	15	2421	November 6, 2023
Fp32 precision support on Jetson AGX Orin Jetson Nano benchmarks	2	627	June 4, 2024
TFLOPS(FP16) about DLA (Deep Learning Accelerator) on Jetson Orin NX Jetson AGX Orin dla , kb	4	2046	April 13, 2023

What is the peak FP32 performance of Orin AGX?

Related topics