But in my test, I got 15 TFLOPS
I use sudo nvpmodel -m 0
This is the code I used to test:
import torch
from torch.utils import benchmark
typ = torch.float16
n = 1024 * 4
a = torch.randn(n, n).type(typ).cuda()
b = torch.randn(n, n).type(typ).cuda()
t = benchmark.Timer(
stmt='a @ b',
globals={'a': a, 'b': b})
x = t.timeit(50)
print('float16 ,',2*n**3 / x.median /1e12)
typ = torch.float32
n = 1024 * 4
a = torch.randn(n, n).type(typ).cuda()
b = torch.randn(n, n).type(typ).cuda()
t = benchmark.Timer(
stmt='a @ b',
globals={'a': a, 'b': b})
x = t.timeit(50)
print('float32 ,',2*n**3 / x.median /1e12)
If these suggestions don’t help and you want to report an issue to us, please attach the model, command/step, and the customized app (if any) with us to reproduce locally.
Hi @carolyuu
thanks, I already did sudo nvpmodel -m 0 and sudo jetson_clocks
The problem right now is the performance I got is larger than the peak performance claimed in the document.
Hi @AastaLLL, thank you very much! I tested the Math: 3708.05 GFLOP/s, which aligns with the documentation. However, this number cannot explain why PyTorch performs faster than the peak performance. To use PyTorch on Orin, we installed a special version from (Installing PyTorch for Jetson Platform - NVIDIA Docs). Could the high performance I observed from the code I shared above be because this PyTorch version not only utilizes the GPU but also calls on DLAs, resulting in much higher peak performance? Or does PyTorch for Jetson use a different precision, even though I set it to use FP32? How can I check and verify this?
Thanks for your quick reply and I am looking forward to hear from you.
As it is PyTorch implementation, please check with the team to see how they implement the function.
For the system part, you can run tegrastats to check if GPU/DLA is used.