Dense vs Sparse Tensor Core Performance (FP16)

elpaul · November 21, 2024, 3:52pm

Hi all,

I’m a bit confused about the theoretical peak performance reported for sparse and dense Tensor Core operations. A lot of literature states that the sparse tensor core (FP16) performance is double the dense tensor core performance. As an example, consider table 6 in https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf, which states 512 FP16 FMA operations per SM in the dense case vs 1024 in the sparse case. I’m trying to understand this doubling in performance and what it means.

Assuming I’m doing a M16N16K8 FP16 MMA operation on tensor cores. In the dense case this would require calculating 2048 FMAs. Given the above theoretical performance numbers it would take me 4 cycles to execute the MMA.
If A is now pruned with the 2:4 sparsity pattern, half of A’s entries are 0 and I only need to calculate 1024 FMA operations to get the final result. According to blog posts like Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT | NVIDIA Technical Blog the latency is halved. So in the end I would still end up with a performance of 1024/(4/2) = 512 FMA/cycle/SM similar to the dense case and different from the reported 1024 FMA/cycle/SM.

So my questions are:

How come the reported theoretical peak performance in the sparse case is 1024 FMA/cycle/SM?
Is it possible to physically execute 1024 FMA operations on my RTX3090 or does this include the 512 FMA operations that we save because of to the sparsity in A and which are never actually executed?
What is the maximum number of FP16 FMA operations that I can physically execute on my RTX3090 GPU per SM per cycle, 512 or 1024?
The sm__inst_executed_pipe_tensor_op_hmma.avg.pct_of_peak_sustained_active metric in Nsight Compute, does it report the peak sustained rate achieved with respect to the sparse FP16 MMA peak performance? I have not managed to get beyond 50% using only dense FP16 MMA.

Thanks a lot in advance.

Robert_Crovella · November 21, 2024, 4:28pm

because it is counting the multiply-by-zero operations, even though they are not actually performed in hardware

the 1024 number includes the 512 ops we save because of to the sparsity in A and which are never actually executed.

Using the thought process we have now established, it is 512 actual multiplications by non-zero elements.

profiler specific questions might be better asked on the nsight compute forum, however this topic has come up a few times and I believe you can find descriptions in various posts with a bit of searching.

elpaul · November 21, 2024, 4:31pm

thanks a lot for the quick reply!

Curefab · November 21, 2024, 6:58pm

Can you cite, where it does say so?

Robert_Crovella · November 21, 2024, 7:01pm

possibly:
" they can complete the same effective calculation in half the time."

Curefab · November 21, 2024, 7:05pm

Yes, that seems to be the citation.

Each MMA instruction would (probably) have the same latency between dense and sparse case.

The overall computation would need half the same, as the multiplications with 0 can be saved.

Nothing special or surprising about latency (low-level latency is the same, high-level latency improved, as computations are done with double the speed).

Robert_Crovella · November 21, 2024, 7:11pm

agreed. I thought about trying to dissect this statement also:

but its kind of nitpicky given that the thrust of the question doesn’t seem to be wrapped up in that statement. However I don’t think I would agree that you can infer latency cycles of a low-level op that way. Latency cannot necessarily be inferred from throughput statements/specs.

Curefab · November 21, 2024, 7:17pm

I try:
Here are some third-party (academic) measured throughput and latency numbers for Ampere:

fp16.fp16.fp16 dense m16n8k16 on A100 and RTX3070Ti: Latency 24.4 and 24.0 cycles
fp16.fp16.fp16 sparse m16n8k32 on A100 and RTX3070Ti: Latency 24.3 and 24.3 cycles

To compare the same effort for the Tensor Cores, the k dimension has to be double in the sparse case.

The latency is the same between dense and sparse (But much higher than 4 cycles as the compuations are pipelined.)

The FMA throughput numbers approximately double between dense and sparse (in the paper they also count the multiplications with 0 and the addition of the result as FMA).

elpaul · November 21, 2024, 7:59pm

Thank you for pointing this out. Yes, I agree that you cannot infer the latency from the throughput numbers in that way. My main interest was indeed to know whether the reported sparse peak performance includes the saved multiply-by-zero operations which are never actually executed in hardware.

Curefab · November 21, 2024, 9:43pm

To be fair, there are lots of cases with at least 50% sparsity (e.g. convolution with some gliding window or pruned neural nets and many others) and it is possible to reorder the sparse dimension to distribute the 0 coefficients.

So the marketing people from Nvidia should be forgiven for overstating the actual number of operations.

PS
Also be careful, when reading performance numbers, with FLOPS vs. FMA. There you get another 2x. And /SM and /Tensor Core, another 4x.

system · December 5, 2024, 9:43pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sparse tensor math speedup on Ampere TensorRT tensorrt , cuda	1	422	December 20, 2023
How to calculate the Tensor Core FP16 performance of H100? CUDA Programming and Performance	9	7649	August 14, 2024
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada? CUDA Programming and Performance cuda , tensorflow , rtx , ampere	10	2667	September 27, 2024
Whether TFLOPS on Tensor Cores is Sparse TFLOPS or DENSE TFLOPS? Jetson AGX Orin documentation , kb	2	989	April 6, 2023
Double precision tensor core performance on A100 CUDA Programming and Performance cuda , a100 , ampere	1	1101	July 7, 2023
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2704	August 12, 2017
Question about tensor cores performance CUDA Programming and Performance	3	797	October 12, 2021
What is the TFLOPS for CUDA/Tensor Cores with FP16 on V100? CUDA Programming and Performance	9	1317	December 10, 2024
2:4 sparsity doesnot improve inference performance on RTX 3090 TensorRT tensorrt	14	3581	September 9, 2022
How many tensor cores to execute the wmma.mma.sync.aligned.{alayout}.{blayout}.m16n16k16 instruction？ CUDA Programming and Performance cuda	23	253	December 12, 2025

Dense vs Sparse Tensor Core Performance (FP16)

Related topics