Why the Compute Throughput's value is different from the actual Performance / Peak Performance

Thank you for pointing to the Roofline Model on NVIDIA GPUs lab. That was very helpful. But I am curious about the “512” number in the 512 x sm__inst_executed_pipe_tensor.sum FLOP calculation. I think this is specific to V100? since the h884 instruction is doing 512 FLOPs per instruction. And for A100, it should be 4096 for h16816, is this correct?

Another issue I have is when I collect information for one kernel in cutlass_profiler as the following, I am getting DRAM read that is less than the memory needed to store A B and C (all equal to 224, so needs 224^232 = 294KB). But the output shows a value of 225KB. For larger problems, this read is always larger than matrix storage. So is this a measurement error or ncu version mismatch or I am not reading the output correctly (output attached)? Thank you for your help!

I am using NCU Version 2021.2.2.0 (build 30282580) (public-release)
NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7
Using NVIDIA A100-SXM4-80GB GPU

allout.txt (25.1 KB)

sudo /usr/local/cuda/bin/ncu --target-processes all --metrics “sm__cycles_elapsed.avg,sm__cycles_elapsed.avg.per_second,sm__sass_thread_inst_executed_op_ffma_pred_on.sum,sm__sass_thread_inst_executed_op_hfma_pred_on.sum,sm__inst_executed_pipe_tensor.sum,dram__bytes.sum” cutlass_profiler --profiling-iterations=1 --verification-enabled=False --kernels=cutlass_tensorop_h16816gemm_256x128_32x3_nn --m=224 --n=224 --k=224 > allout.txt 2>&1