Why I get much higher TFLOPS in Orin AGX than what claimed in the document

df23r23r · October 31, 2024, 4:24am

Dear all,

What is the peak FP32 performance of Orin AGX?

In the document I saw 5.3 TFLOPS https://www.nvidia.com/content/dam/en-zz/Solutions/gtcf21/jetson-orin/nvidia-jetson-agx-orin-technical-brief.pdf

But in my test, I got 15 TFLOPS
I use sudo nvpmodel -m 0

This is the code I used to test:

import torch
from torch.utils import benchmark

typ = torch.float16
n = 1024 * 4
a = torch.randn(n, n).type(typ).cuda()
b = torch.randn(n, n).type(typ).cuda()

t = benchmark.Timer(
      stmt='a @ b',
      globals={'a': a, 'b': b})

x = t.timeit(50)
print('float16 ,',2*n**3 / x.median /1e12)


typ = torch.float32
n = 1024 * 4
a = torch.randn(n, n).type(typ).cuda()
b = torch.randn(n, n).type(typ).cuda()

t = benchmark.Timer(
              stmt='a @ b',
                    globals={'a': a, 'b': b})

x = t.timeit(50)
print('float32 ,',2*n**3 / x.median /1e12)

I got:
root@veagxorin-desktop:~# python peaktorch.py
float16 , 22.82726626095605
float32 , 14.871669421596716

carolyuu · October 31, 2024, 4:30am

Hi,
Here are some suggestions for the common issues:

1. Performance

Please run the below command before benchmarking deep learning use case:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

2. Installation

Installation guide of deep learning frameworks on Jetson:

TensorFlow: Installing TensorFlow for Jetson Platform - NVIDIA Docs
PyTorch: Installing PyTorch for Jetson Platform - NVIDIA Docs
We also have containers that have frameworks preinstalled:
GPU-optimized AI, Machine Learning, & HPC Software | NVIDIA NGC

3. Tutorial

Startup deep learning tutorial:

Jetson-inference: Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson
TensorRT sample: Making sure you're not a bot!

4. Report issue

If these suggestions don’t help and you want to report an issue to us, please attach the model, command/step, and the customized app (if any) with us to reproduce locally.

Thanks!

df23r23r · October 31, 2024, 4:38am

Hi @carolyuu
thanks, I already did sudo nvpmodel -m 0 and sudo jetson_clocks
The problem right now is the performance I got is larger than the peak performance claimed in the document.

@Jetson AGX Orin TOPs / CUDA Cores Explained - #8 by xuyi19920502
at the end of this post, they said they also measured FP16 to be 20 + TFLOPs, too.
Is there any possible explain of that? Thanks!

df23r23r · October 31, 2024, 4:45am

I also tried this code

import torch
import time

def measure_tflops(device, dtype, size=4096, iterations=50):
    # Create random matrices
    a = torch.rand(size, size, device=device, dtype=dtype)
    b = torch.rand(size, size, device=device, dtype=dtype)

    # Warm up
    for _ in range(10):
        torch.mm(a, b)

    torch.cuda.synchronize(device)

    # Benchmark
    start_time = time.time()
    for _ in range(iterations):
        torch.mm(a, b)
    torch.cuda.synchronize(device)
    end_time = time.time()

    # Calculate TFLOPS
    elapsed_time = end_time - start_time
    operations_per_matrix_multiplication = 2 * size ** 3  # 2 * n^3 FLOPs for matrix multiplication
    total_operations = operations_per_matrix_multiplication * iterations
    tflops = (total_operations / elapsed_time) / 1e12

    return tflops

# Set the device to GPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
dtype = torch.float32  # TF32 is used like float32 in PyTorch on compatible GPUs

# Measure TFLOPS
tflops = measure_tflops(device, dtype)
print(f"Peak TFLOPS: {tflops}")

And I got

root@veagxorin-desktop:~# python peak.py
Peak TFLOPS: 12.962248670484703

Hi @carolyuu Still way above 5.3 TFLOPs as claimed by the doc
Thanks for any help!

AastaLLL · November 1, 2024, 5:48am

Hi,

It’s recommended to profile it with our cutlass toolkit:

Thanks.

df23r23r · November 1, 2024, 9:29pm

Hi @AastaLLL, thank you very much! I tested the Math: 3708.05 GFLOP/s, which aligns with the documentation. However, this number cannot explain why PyTorch performs faster than the peak performance. To use PyTorch on Orin, we installed a special version from (Installing PyTorch for Jetson Platform - NVIDIA Docs). Could the high performance I observed from the code I shared above be because this PyTorch version not only utilizes the GPU but also calls on DLAs, resulting in much higher peak performance? Or does PyTorch for Jetson use a different precision, even though I set it to use FP32? How can I check and verify this?
Thanks for your quick reply and I am looking forward to hear from you.

AastaLLL · November 4, 2024, 7:14am

Hi,

As it is PyTorch implementation, please check with the team to see how they implement the function.
For the system part, you can run tegrastats to check if GPU/DLA is used.

$ sudo tegrastats