Multi-head attention performance

According to this (*) paper, “cuDNN’s performance is orders of magnitude worse” compared to e.g. Pytorch or TensorFlow. This statement refers to version 7.6.5 and is hopefully outdated, as there were major improvements announced with cuDNN 8.3.0. I would like to know if there are any current benchmarks/comparisons with version 8.3 or later with respect to multi-head attention?

(*) http://www.unixer.de/publications/img/data_movement_is_all_you_need.pdf

Hi,

I think we do not have the latest Benchmarks on multi-head attention, will check and get back to you if any.
For the latest changes in the cuDNN, please refer release notes.

Thank you.