Performant FMHA kernels in CuDNN

Hello
In the GTC24’s S62457 talk, below page described FMHA kernels implemented by CuDNN backend API are more performant than flash-attention backed implementation under some cases. Could you elaborate more on the HW/SW setup details, and what is your plan to enhance this part in CuDNN, we are evaluating if need to switch to CuDNN for FMHA in the future. Thanks!

this may also be of interest.

Hi Rob
After profiling CuDNN FMHA kernels in Hopper, found it already used HGMMA SASS. Can we say for Ampere, open-source flash-attention2 is already good enough, but for Hopper, currently we can try CuDNN for better performance?

Thanks
Gino

Can we say for Ampere, open-source flash-attention2 is already good enough, but for Hopper, currently we can try CuDNN for better performance?

https://docs.nvidia.com/deeplearning/cudnn/latest/release-notes.html#cudnn-9-0-0

  • FP16 and BF16 fused flash attention engine performance has been significantly improved for NVIDIA GPUs:
    • Speed-up of up to 50% over cuDNN 8.9.7 on Hopper GPUs.
    • Speed-up of up to 100% over cuDNN 8.9.7 on Ampere GPUs.

At this time, cuDNN should be as fast or faster on Ampere and faster on Hopper.