Tensorcore roofline

I am profiling a single A100

Nsight Compute GUI version 2023.2.2.0 shows Floating Point Roofline for Double, Single, and Half Precision. (I assume for CUDA core)

Then, it also shows a ‘Floating Point Operations Roofline (Tensor Core)’
As it’s peak compute throughput shows 116 TFLOPS, this roofline seems to measure the TF32 performance. (156 TFLOPS on Spec Sheet)

Upon looking around the forum,
Roofline Tensor Core should be half but not float? - Developer Tools / Nsight Compute - NVIDIA Developer Forums suggests that the Tensor Core roofline metric supports only FP16 for GV100.

Question about Roofline of TensorCore GEMM - Developer Tools / Nsight Compute - NVIDIA Developer Forums suggests that the Tensor Core roofline only supports the GV100 architecture.

Then my questions are:

  1. Is Nsight Compute showing the accurate roofline for A100 Tensor Cores?
  2. If so, is it displaying the roofline in TF32. In which I should double the FLOPS if I am using FP16 dtype.

Hi, @dhjoo982

We recently released 2024.3.0 with more support for tensor code for all chips .
Please get the latest build to check.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.