Hi all,
What is the peak FP32 performance of Orin AGX?
In the document I saw 5.3 TFLOPS https://www.nvidia.com/content/dam/en-zz/Solutions/gtcf21/jetson-orin/nvidia-jetson-agx-orin-technical-brief.pdf
But in my test, I got 15 TFLOPS
I use sudo nvpmodel -m 0 and sudo jetson_clocks
I used two kinds of the code to make test:
import torch
from torch.utils import benchmark
typ = torch.float16
n = 1024 * 4
a = torch.randn(n, n).type(typ).cuda()
b = torch.randn(n, n).type(typ).cuda()
t = benchmark.Timer(
stmt='a @ b',
globals={'a': a, 'b': b})
x = t.timeit(50)
print('float16 ,',2*n**3 / x.median /1e12)
typ = torch.float32
n = 1024 * 4
a = torch.randn(n, n).type(typ).cuda()
b = torch.randn(n, n).type(typ).cuda()
t = benchmark.Timer(
stmt='a @ b',
globals={'a': a, 'b': b})
x = t.timeit(50)
print('float32 ,',2*n**3 / x.median /1e12)
I got:
root@veagxorin-desktop:~# python peaktorch.py
float16 , 22.82726626095605
float32 , 14.871669421596716
I also tried this code
import torch
import time
def measure_tflops(device, dtype, size=4096, iterations=50):
# Create random matrices
a = torch.rand(size, size, device=device, dtype=dtype)
b = torch.rand(size, size, device=device, dtype=dtype)
# Warm up
for _ in range(10):
torch.mm(a, b)
torch.cuda.synchronize(device)
# Benchmark
start_time = time.time()
for _ in range(iterations):
torch.mm(a, b)
torch.cuda.synchronize(device)
end_time = time.time()
# Calculate TFLOPS
elapsed_time = end_time - start_time
operations_per_matrix_multiplication = 2 * size ** 3 # 2 * n^3 FLOPs for matrix multiplication
total_operations = operations_per_matrix_multiplication * iterations
tflops = (total_operations / elapsed_time) / 1e12
return tflops
# Set the device to GPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
dtype = torch.float32 # TF32 is used like float32 in PyTorch on compatible GPUs
# Measure TFLOPS
tflops = measure_tflops(device, dtype)
print(f"Peak TFLOPS: {tflops}")
And I got
root@veagxorin-desktop:~# python peak.py
Peak TFLOPS: 12.962248670484703
@Jetson AGX Orin TOPs / CUDA Cores Explained - #8 by xuyi19920502
at the end of this post, they also said the measured FP16 to be 20 + TFLOPs and way above the theoretical peak, too.
Is there any possible explain of that? Thanks!
(This post is a repost of Why I get much higher TFLOPS in Orin AGX than what claimed in the document - #4 by df23r23r, as it seems like I posted to the wrong category; thanks!)