I’m a bit confused about the theoretical peak performance reported for sparse and dense Tensor Core operations. A lot of literature states that the sparse tensor core (FP16) performance is double the dense tensor core performance. As an example, consider table 6 in https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf, which states 512 FP16 FMA operations per SM in the dense case vs 1024 in the sparse case. I’m trying to understand this doubling in performance and what it means.
Assuming I’m doing a M16N16K8 FP16 MMA operation on tensor cores. In the dense case this would require calculating 2048 FMAs. Given the above theoretical performance numbers it would take me 4 cycles to execute the MMA.
If A is now pruned with the 2:4 sparsity pattern, half of A’s entries are 0 and I only need to calculate 1024 FMA operations to get the final result. According to blog posts like Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT | NVIDIA Technical Blog the latency is halved. So in the end I would still end up with a performance of 1024/(4/2) = 512 FMA/cycle/SM similar to the dense case and different from the reported 1024 FMA/cycle/SM.
So my questions are:
How come the reported theoretical peak performance in the sparse case is 1024 FMA/cycle/SM?
Is it possible to physically execute 1024 FMA operations on my RTX3090 or does this include the 512 FMA operations that we save because of to the sparsity in A and which are never actually executed?
What is the maximum number of FP16 FMA operations that I can physically execute on my RTX3090 GPU per SM per cycle, 512 or 1024?
The sm__inst_executed_pipe_tensor_op_hmma.avg.pct_of_peak_sustained_active metric in Nsight Compute, does it report the peak sustained rate achieved with respect to the sparse FP16 MMA peak performance? I have not managed to get beyond 50% using only dense FP16 MMA.
because it is counting the multiply-by-zero operations, even though they are not actually performed in hardware
the 1024 number includes the 512 ops we save because of to the sparsity in A and which are never actually executed.
Using the thought process we have now established, it is 512 actual multiplications by non-zero elements.
profiler specific questions might be better asked on the nsight compute forum, however this topic has come up a few times and I believe you can find descriptions in various posts with a bit of searching.
Each MMA instruction would (probably) have the same latency between dense and sparse case.
The overall computation would need half the same, as the multiplications with 0 can be saved.
Nothing special or surprising about latency (low-level latency is the same, high-level latency improved, as computations are done with double the speed).
agreed. I thought about trying to dissect this statement also:
but its kind of nitpicky given that the thrust of the question doesn’t seem to be wrapped up in that statement. However I don’t think I would agree that you can infer latency cycles of a low-level op that way. Latency cannot necessarily be inferred from throughput statements/specs.
I try:
Here are some third-party (academic) measured throughput and latency numbers for Ampere:
fp16.fp16.fp16 dense m16n8k16 on A100 and RTX3070Ti: Latency 24.4 and 24.0 cycles
fp16.fp16.fp16 sparse m16n8k32 on A100 and RTX3070Ti: Latency 24.3 and 24.3 cycles
To compare the same effort for the Tensor Cores, the k dimension has to be double in the sparse case.
The latency is the same between dense and sparse (But much higher than 4 cycles as the compuations are pipelined.)
The FMA throughput numbers approximately double between dense and sparse (in the paper they also count the multiplications with 0 and the addition of the result as FMA).
Thank you for pointing this out. Yes, I agree that you cannot infer the latency from the throughput numbers in that way. My main interest was indeed to know whether the reported sparse peak performance includes the saved multiply-by-zero operations which are never actually executed in hardware.
To be fair, there are lots of cases with at least 50% sparsity (e.g. convolution with some gliding window or pruned neural nets and many others) and it is possible to reorder the sparse dimension to distribute the 0 coefficients.
So the marketing people from Nvidia should be forgiven for overstating the actual number of operations.
PS
Also be careful, when reading performance numbers, with FLOPS vs. FMA. There you get another 2x. And /SM and /Tensor Core, another 4x.