Hi,
In cudnnSetAttnDescriptor() and the fwd, bwd weight, bwd data APIs.
(1) What are the Q K V input data layouts required in the global memory if the corresponding Q, K, and V projection size is set to zero (the input is the data output by the specific linear layer)? To be specific, does vec_dim
have the size [numHeads*headDim] with headDim continuous in the global memory?
(2) Why oSize equals to numHeads*vSize if both the linear layers for V and O are not included? This requirement seems to consume more global memory than normal attentions do (more precisely, numHeads times more).