I benched my Jetson Orin nano devkit on cutlass under f16 and the highest performance mma was around 9124 GFlop/s. This seems really high, the A100 does 256 flop/cycle per tensor core, so 1024 flop/cycle/SM. If my jetson is running at 625 MHz, total performance would be around (625e6 Hz) * (8 SMs) * (1024 flop/cycle) = 5.12e12 flop/s. These types of calculations seem to work perfectly for the A100, 3090, and 4090 from what I’ve seen personally. Is there something I’m missing?
Sorry, it’s 1024 FMA/cycle which is 2*1024 Flop/cycle, now it all makes sense
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.