I am profiling a single A100
Nsight Compute GUI version 2023.2.2.0 shows Floating Point Roofline for Double, Single, and Half Precision. (I assume for CUDA core)
Then, it also shows a ‘Floating Point Operations Roofline (Tensor Core)’
As it’s peak compute throughput shows 116 TFLOPS, this roofline seems to measure the TF32 performance. (156 TFLOPS on Spec Sheet)
Upon looking around the forum,
Roofline Tensor Core should be half but not float? - Developer Tools / Nsight Compute - NVIDIA Developer Forums suggests that the Tensor Core roofline metric supports only FP16 for GV100.
Question about Roofline of TensorCore GEMM - Developer Tools / Nsight Compute - NVIDIA Developer Forums suggests that the Tensor Core roofline only supports the GV100 architecture.
Then my questions are:
- Is Nsight Compute showing the accurate roofline for A100 Tensor Cores?
- If so, is it displaying the roofline in TF32. In which I should double the FLOPS if I am using FP16 dtype.