Performant FMHA kernels in CuDNN

csdncannon · April 22, 2024, 1:25pm

Hello
In the GTC24’s S62457 talk, below page described FMHA kernels implemented by CuDNN backend API are more performant than flash-attention backed implementation under some cases. Could you elaborate more on the HW/SW setup details, and what is your plan to enhance this part in CuDNN, we are evaluating if need to switch to CuDNN for FMHA in the future. Thanks!

Robert_Crovella · April 22, 2024, 2:47pm

this may also be of interest.

csdncannon · April 23, 2024, 7:19am

Hi Rob
After profiling CuDNN FMHA kernels in Hopper, found it already used HGMMA SASS. Can we say for Ampere, open-source flash-attention2 is already good enough, but for Hopper, currently we can try CuDNN for better performance?

Thanks
Gino

mnicely · May 20, 2024, 2:25pm

Can we say for Ampere, open-source flash-attention2 is already good enough, but for Hopper, currently we can try CuDNN for better performance?

https://docs.nvidia.com/deeplearning/cudnn/latest/release-notes.html#cudnn-9-0-0

FP16 and BF16 fused flash attention engine performance has been significantly improved for NVIDIA GPUs:

Speed-up of up to 50% over cuDNN 8.9.7 on Hopper GPUs.

Speed-up of up to 100% over cuDNN 8.9.7 on Ampere GPUs.

At this time, cuDNN should be as fast or faster on Ampere and faster on Hopper.

Topic		Replies	Views
Accelerating Transformers with NVIDIA cuDNN 9 Technical Blog cudnn	1	275	January 12, 2025
Next Generation of FlashAttention Technical Blog	0	189	July 11, 2024
NVIDIA cuDNN 9로 트랜스포머 가속화 Technical Blog - South Korea cudnn	0	165	June 12, 2024
Just Released: NVIDIA cuDNN 9.7 Technical Blog cudnn	0	142	January 31, 2025
Using cuDNN Backend to Create a Fused Attention fprop Graph cuDNN cudnn	6	363	January 3, 2025
cuDNN high performance NHWC support GPU-Accelerated Libraries	0	637	August 11, 2016
How to get better conv performance with cudnn? cuDNN	1	841	September 25, 2023
Multi-head attention performance cuDNN	1	1081	August 12, 2022
cuDNN vs cuBLAS performance on GEMMs GPU-Accelerated Libraries performance , cudnn , cublas , benchmarks	0	151	June 19, 2025
cuDNN 8.x.x vs cuDNN 7.6.5 performance drop cuDNN performance	7	1926	August 26, 2021

Performant FMHA kernels in CuDNN

Related topics