How to measure Tensor FLOPs?

diogo.s.matos · May 13, 2024, 11:22pm

Hello,

I’m trying to use ncu to benchmark some applications for their performance regarding the usage of Tensor Cores (the devices I’m using are a 3080 and a A100).

For simple scenarios where I’m performing matrix multiplication with known values for M,N and K, I can calculate the # of FLOPs from these values, and using the execution time I can calculate the performance.

But in other scenarios, such as when I’m trying to benchmark another more complex application that might have different mat muls, and other TC operations, I’d like to use a more general query through ncu, for example. In the same way one can use sm__inst_executed_pipe_fma to obtain the number of FMA instructions and then multiply by two to obtain the executed FLOPs, I’d like to know if there is an alternative strategy regarding TC.

I’ve seen that there is the sm__inst_executed_pipe_tensor query, but since different Tensor Instructions have different FLOP outputs (different shapes: eg, m16n8k4 and m16n8k8 for TF32), it does not seem to be a viable option to properly calculate the executed FLOPs in the TC, without knowing what shape is being executed.

I’d love if someone could please give some advice on this matter. Thank you.

Robert_Crovella · May 14, 2024, 2:32pm

I guess to get a first order approximation, I would look at the percent utilization of the relevant pipe (in “Compute Throughput Breakdown”) as reported in the GPU SOL report from ncu, and then multiply that by the device’s peak theoretical performance for that pipe.

diogo.s.matos · May 14, 2024, 4:51pm

Thank you for your contribution!
I’ve tried your suggestion and compared it with my well-known values and it seems to differ quite significantly.

In my tests, when I execute the code with ncu, I obtain a performance of 25TFLOPs, which is roughly 85% of the peak of my 3080. However, sm__pipe_tensor_cycles_elapsed.avg.pct_of_peak_sustained_elapsed returns a usage of 49.91%, which is much lower than the performance I’m actually achieving. Any reason why this might be?

Robert_Crovella · May 14, 2024, 5:07pm

So what is the actual op you are using? TF32?

And how did you arrive at the 25TF number?

Curefab · May 14, 2024, 5:20pm

Do you have the 3080 variant with 68 or with 70 SMs? What is your clock speed in MHz (base/boost)? Sparse or Dense Matrices? Do you count Additions and Multiplications each (which doubles FLOPs)? For the 3080 (different to the A100), when using FP16 A and B input matrices the performance also depends on whether you accumulate with FP32 or FP16 precision.

Example calculations (dense):

1440 MHz base clock * 68 SMs * 4 Tensor Cores * 32 TF32 * 2 (ADD+MUL) = 25.1 TFLOPs (FMAD as two)
1440 MHz base clock * 68 SMs * 4 Tensor Cores * 64 FP16 with FP32 accumulate = 25.1 TFLOPs (FMAD as one)
1440 MHz base clock * 68 SMs * 4 Tensor Cores * 128 FP16 with FP16 accumulate = 50.1 TFLOPs (FMAD as one)
1710 MHz boost clock * 68 SMs * 4 Tensor Cores * 32 TF32 * 2 (ADD+MUL) = 29.8 TFLOPs (FMAD as two)
1710 MHz boost clock * 68 SMs * 4 Tensor Cores * 64 FP16 with FP32 accumulate = 29.8 TFLOPs (FMAD as one)
1710 MHz boost clock * 68 SMs * 4 Tensor Cores * 128 FP16 with FP16 accumulate = 59.5 TFLOPs (FMAD as one)

diogo.s.matos · May 14, 2024, 5:34pm

TF32 m16n8k8!
I’ve arrived at the 25TF by calculating the dimensions of the matrices in the mat mul (mnk) and dividing by the execution time.

In reality, it’s a bit more complicated because every thread is running the same ptx instruction mma.m16n8k8 for a number of ITERATIONS, so in the end fma = (uint64_t)M * N * K * ITERATIONS * (THREADS_PER_BLOCK / 32) * NUM_BLOCKS.
And then flops=2*fma.

diogo.s.matos · May 14, 2024, 5:42pm

I’ve pointed out my calculations for flops in my previous answer, and I’m confident they are correct. I do have the 68SM 3080, and thank you for raising an interesting point that I completely missed.

When executing ncu it does locks the clock at 1440 MHz, which is how I obtained the 25TFLOPs, which matches the expected maximum attainable value at that speed. In all other testing I do (without ncu), I use 1710 MHz and achieve the 29.8TF!

So in the case I pointed out, I’m actually pretty much fully utilizing the GPU and still the pipe only indicated 49.9% usage.

Curefab · May 14, 2024, 5:57pm

Can you confirm, you used the dense variant? m16n8k8 exists as both.

For FP32 summation (which is used for TF32), the consumer compute capability 8.6 Tensor Cores have half the speed of the professional compute capability 8.6 Tensor Cores.

Based on that just guessing for a reason of your results:

Perhaps the shown utilization does not consider this fact. Or the (input) pipe is full-speed, but a later unit is slower and the (input) pipe is utilized 50%? Perhaps Robert knows?

Can you try FP16 with FP16 accumulation (which has no difference between consumer and professional GPUs) or try the A100 with TF32?

Robert_Crovella · May 14, 2024, 5:59pm

can you paste in a picture from the nsight compute GUI showing the compute throughput breakdown in the SOL section.

before taking the screenshot, hover your mouse over the top item title in the pareto list, so we can see the computation of that pipe throughput, as well as the reported throughput.

diogo.s.matos · May 15, 2024, 10:51am

I hope this is what you meant! This time I ran slightly less iterations just so that the profiling would be quicker, but it should still be extremely close to top expected performance.

@Curefab yes, dense! I have the same code in FP16 with 16 accum, and ran that, and you might be on to something because now the tensor pipe was indicating 97,86%! The performance achieved when running at 1710MHz was 116.51TF, which is 97,9% of the maximum throughput of 119TF at this clock!

Curefab · May 15, 2024, 1:19pm

The following is Turing, with no difference between consumer and professional Tensor Core performance, but also about too low pipe utilization with a factor of 2 or 4 (50% and 25%) - could be related: Why Low Tensor Pipe Utilization

Robert_Crovella · May 15, 2024, 2:07pm

It certainly seems like its based on nsight compute reporting. The link Curefab posted may be the answer, or your could post your question on the nsight compute forum.

Curefab · May 15, 2024, 2:44pm

Lots of further links:

I had once a similar issue, but without solution:

And this question is also related:

There was an issue (once?) with stored roofline of Tensor Cores for different GPU architectures (basically rooflines have been only correct for GV100 Volta GPUs):

or

Those detailed counters could help you calculate exact FLOPs (but only if you know some details of your instructions; it is not enough for fully unknown code to deduce FLOPs):

or

diogo.s.matos · May 15, 2024, 6:39pm

Amazing content, thank you so much! It didn’t occur to me that it would be more appropriate to search in the ncu forums (I’m fairly new to these forums), but thank you and @Robert_Crovella for the help you have provided.

The three bottom links you have provided seem to be the answer to my worries, I don’t have those metrics available but I’m updating my version of ncu to see if it makes them available (I have the 2022.4 version).

system · May 29, 2024, 6:39pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why Low Tensor Pipe Utilization CUDA Programming and Performance cuda , kernel	4	1276	May 20, 2022
NSight : How to calculate FLOP/s that's close to achieved FLOP/s CUDA Programming and Performance	3	3117	October 4, 2017
How to measure FLOPs of a cuda kernel function by using Nsight-Compute on A100 GPU? Nsight Compute kernel	2	925	August 16, 2024
Why the number of flops is different between FP32 and FP16 mode with YOLOv3 TensorRT implementation? Jetson AGX Xavier tensorrt , kernel , profiling	8	4013	October 18, 2021
Nsight Compute-Roofline chart Nsight Compute	12	1625	September 20, 2024
Roofline Tensor Core should be half but not float? Nsight Compute	3	1418	May 29, 2024
Question about Roofline of TensorCore GEMM Nsight Compute	3	1508	August 7, 2024
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada? CUDA Programming and Performance cuda , tensorflow , rtx , ampere	10	1882	September 27, 2024
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2547	August 12, 2017
Differences in Precision Between Tensor Cores and CUDA Cores CUDA Programming and Performance cuda	1	122	January 10, 2025

How to measure Tensor FLOPs?

Related topics