SeqData and Multi-head Attention

Hi,

In cudnnSetAttnDescriptor() and the fwd, bwd weight, bwd data APIs.

(1) What are the Q K V input data layouts required in the global memory if the corresponding Q, K, and V projection size is set to zero (the input is the data output by the specific linear layer)? To be specific, does vec_dim have the size [numHeads*headDim] with headDim continuous in the global memory?

(2) Why oSize equals to numHeads*vSize if both the linear layers for V and O are not included? This requirement seems to consume more global memory than normal attentions do (more precisely, numHeads times more).

HI @jundaf3 ,
Below link should be able to assist you better.
https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnSetAttnDescriptor

Thanks

Is it possible to provide a more detailed explanation? I do not see an answer to the first question in the documentation currently. I am trying to make an attention implementation with QKV already projected to separate heads. Meaning Q, K, V have shape (batch, sequence, num heads, head dim) or (sequence, batch, num heads, head dim).

I am currently unable to make the cudnnMultiHeadAttnForward call act as just a SDPA call with multiple heads that also projects the output. An answer to the first question would be very helpful in accomplishing that. My SDPA implementation works when there is one head, I am not sure how to tell cuDNN that the input is projected to separate heads already, the nHeads parameter in cudnnSetAttnDescriptor does not appear to do that.