What is the peak FP32 performance of Orin AGX?

Hi all,
What is the peak FP32 performance of Orin AGX?

In the document I saw 5.3 TFLOPS https://www.nvidia.com/content/dam/en-zz/Solutions/gtcf21/jetson-orin/nvidia-jetson-agx-orin-technical-brief.pdf

But in my test, I got 15 TFLOPS
I use sudo nvpmodel -m 0 and sudo jetson_clocks

I used two kinds of the code to make test:

import torch
from torch.utils import benchmark

typ = torch.float16
n = 1024 * 4
a = torch.randn(n, n).type(typ).cuda()
b = torch.randn(n, n).type(typ).cuda()

t = benchmark.Timer(
      stmt='a @ b',
      globals={'a': a, 'b': b})

x = t.timeit(50)
print('float16 ,',2*n**3 / x.median /1e12)


typ = torch.float32
n = 1024 * 4
a = torch.randn(n, n).type(typ).cuda()
b = torch.randn(n, n).type(typ).cuda()

t = benchmark.Timer(
              stmt='a @ b',
                    globals={'a': a, 'b': b})

x = t.timeit(50)
print('float32 ,',2*n**3 / x.median /1e12)

I got:
root@veagxorin-desktop:~# python peaktorch.py
float16 , 22.82726626095605
float32 , 14.871669421596716

I also tried this code

import torch
import time

def measure_tflops(device, dtype, size=4096, iterations=50):
    # Create random matrices
    a = torch.rand(size, size, device=device, dtype=dtype)
    b = torch.rand(size, size, device=device, dtype=dtype)

    # Warm up
    for _ in range(10):
        torch.mm(a, b)

    torch.cuda.synchronize(device)

    # Benchmark
    start_time = time.time()
    for _ in range(iterations):
        torch.mm(a, b)
    torch.cuda.synchronize(device)
    end_time = time.time()

    # Calculate TFLOPS
    elapsed_time = end_time - start_time
    operations_per_matrix_multiplication = 2 * size ** 3  # 2 * n^3 FLOPs for matrix multiplication
    total_operations = operations_per_matrix_multiplication * iterations
    tflops = (total_operations / elapsed_time) / 1e12

    return tflops

# Set the device to GPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
dtype = torch.float32  # TF32 is used like float32 in PyTorch on compatible GPUs

# Measure TFLOPS
tflops = measure_tflops(device, dtype)
print(f"Peak TFLOPS: {tflops}")

And I got

root@veagxorin-desktop:~# python peak.py
Peak TFLOPS: 12.962248670484703

@Jetson AGX Orin TOPs / CUDA Cores Explained - #8 by xuyi19920502
at the end of this post, they also said the measured FP16 to be 20 + TFLOPs and way above the theoretical peak, too.
Is there any possible explain of that? Thanks!

(This post is a repost of Why I get much higher TFLOPS in Orin AGX than what claimed in the document - #4 by df23r23r, as it seems like I posted to the wrong category; thanks!)

Duplicated with Why I get much higher TFLOPS in Orin AGX than what claimed in the document - Jetson & Embedded Systems / Jetson AGX Orin - NVIDIA Developer Forums