Hi,
cuDNN also features MHA and delivers excellent performance. But, it
need to construct a big graph from C-API to get the MHA engine. I think it maybe a little complex and easy causes mistake. Although using front-end api maybe better, but constructing a big graph also may cost CPU a lot. Why not just add an MHA Operation Descriptor?