According to this (*) paper, “cuDNN’s performance is orders of magnitude worse” compared to e.g. Pytorch or TensorFlow. This statement refers to version 7.6.5 and is hopefully outdated, as there were major improvements announced with cuDNN 8.3.0. I would like to know if there are any current benchmarks/comparisons with version 8.3 or later with respect to multi-head attention?
(*) http://www.unixer.de/publications/img/data_movement_is_all_you_need.pdf