How concat weights for cudnnMultiHeadAttnForward dw

  1. How can I concat weights for cudnnMultiHeadAttnForward dw?
    I get weights by tensorflow tf.get_variable(h, w)
    Can I do like the below?
    concat (wq.flaten, wk.flaten, wv.flaten, wo.flaten)

  2. Why " The cudnnMultiHeadAttnBackwardData() function must be invoked after
    cudnnMultiHeadAttnForward()" ? Forward has long latency.

mlperf1.0 bert did not use cudnnMultiHeadAttnForward /backward ? Do you have plan to optimize it? training_results_v1.0/NVIDIA/benchmarks/bert/implementations/pytorch/mhalib at master · mlcommons/training_results_v1.0 · GitHub

Hi @ding9801 ,
Please allow us sometime to get back to you on this.
Thank you for your patience.

Thanks.
Look forward to your update.

Can you give the update?
No user uses this API ?

@ding9801: Please download cuDNN samples. The “multiHeadAttention” sample shows how to access MHA weights. See the saveAllParams() function. The weight layout is the same in fwd and bwd calls.

The “multiHeadAttention” sample also shows how to invoke cudnnMultiHeadAttnForward(), cudnnMultiHeadAttnBackwardData(), and cudnnMultiHeadAttnBackwardWeights() API-s. The sample code has also a simple reference model in “attn_ref.py”.

Yes, we have plans to further optimize this API implementation.

Can you give the weights layout more details? I found wq, wk, wv is the same, but wo is different.
wq, wk, wv: seq1-head1, seq1-head2, seq1-head3, … seq2-head1, seq2-head2, seq2-head3…
wo: seq1-head1, seq2-head1, seq3-head1, …, seq1-head2, seq2-head2, seq3-head2…
Is it right?

Can you answer my second question? ----Why " The cudnnMultiHeadAttnBackwardData() function must be invoked after cudnnMultiHeadAttnForward()" ?

The multi-head attention API was designed in such a way that the weight layout is not fixed. The user needs to invoke cudnnGetMultiHeadAttnWeights() and obtain the layout of each group of weights. That way, the cuDNN library can select a weight layout that is optimal for the given GPU or math type. In the multi-head attention API we decided to be more flexible in comparison to the RNN API and report tensor dimensions with strides. That way the weight layout can change.

The cuDNN RNN API calls may change the weight layout internally as a GEMM (GEneral Matrix Multiply) speed optimization. In the multi-head attention API we wanted to eliminate this step (that could be performed in every API invocation) and instead communicate to the user that a particular weight layout would be used. So any on-the-fly weight transpose/padding could be avoided.

It is correct that the layout of output projection weights is different from other weights. Currently, output projection weights have the column-major layout and are concatenated horizontally as, for example, in the horzcat() function of MATLAB. Other weights have the column-major layout and are concatenated vertically as in the vertcat() function. Please remember that this layout choice is not fixed.

There could be small, unused memory gaps between groups of weights to guarantee better data alignment.

We had requests to support other special layouts, such as concatenating wq, wk, wv (with no gaps) whenever it makes sense. This is not implemented but we may support it via the attnMode argument of cudnnSetAttnDescriptor().

Why " The cudnnMultiHeadAttnBackwardData() function must be invoked after cudnnMultiHeadAttnForward()" ?

All cuDNN API-s follow the same design pattern. First, you need to invoke the “forward” function. Next, you need to call “backward” functions. In the “inference” mode, the “backward” API-s are not used. In the “training” mode, the “forward” call may save some intermediate results in the “reserve” buffer. Those results are consumed by the two “backward” calls: “backward data” and “backward weights”. So the sequence of calls is: (1) “forward” API, (2) “backward data”, (3) “backward weights”.

What does the “reserve” buffer save?
In tensorflow ops.RegisterGradient, backward op can get gradient, op forward input/output, attribute. What else does cudnnMultiHeadAttnBackward need?
I think cudnnMultiHeadAttnBackward is designed for training. Is there suggestion for tensorflow op or pytorch layer development?

The cuDNN library offers a lower level API that does not perform memory management. The caller needs to allocate and supply all input and output buffers. Some API-s need temporary storage just for the duration of one call. In the cuDNN nomenclature we call this storage a work-space buffer. Moreover, there could be a need for one more buffer type to exchange data between “forward”, “backward data”, and “backward weights” calls. We call this buffer, the reserve-space buffer. The type of information stored in the reserve-space buffer is dictated by back-propagation math of a particular DL model.

Yes, cudnnMultiHeadAttnBackwardWeights() is designed for training to compute exact, first order, partial derivatives of the error function with respect to all trainable model parameters. It is possible to automatically differentiate “forward” code based on the forward operation graph. cuDNN does not use this solution. The 'backward data” and “backward weights” routines are hand coded and optimized.

Thanks for your reply.

I find the performance is bad.

I add the timer as the below:

For IS_FORWARD == 1

double start = seconds();
cudnnMultiHeadAttnForward
cudaDeviceSynchronize();
double stop = seconds();

For backward, IS_FORWARD == 0

double start = seconds();
cudnnMultiHeadAttnForward
cudnnMultiHeadAttnBackwardData
cudnnMultiHeadAttnBackwardWeights
cudaDeviceSynchronize();
double stop = seconds();

duration = stop - start;

The result:

IS_FORWARD 1, Elapsed time = 0.000691891 sec
IS_FORWARD 0, Elapsed time = 0.262995 sec
IS_FORWARD 1, Elapsed time = 0.000668049 sec
IS_FORWARD 0, Elapsed time = 0.262635 sec
IS_FORWARD 1, Elapsed time = 0.000663996 sec
IS_FORWARD 0, Elapsed time = 0.263343 sec
IS_FORWARD 1, Elapsed time = 0.000674963 sec
IS_FORWARD 0, Elapsed time = 0.263066 sec
IS_FORWARD 1, Elapsed time = 0.000671148 sec
IS_FORWARD 0, Elapsed time = 0.262862 sec
IS_FORWARD 1, Elapsed time = 0.000673056 sec
IS_FORWARD 0, Elapsed time = 0.262764 sec
IS_FORWARD 1, Elapsed time = 0.000664949 sec
IS_FORWARD 0, Elapsed time = 0.262851 sec
IS_FORWARD 1, Elapsed time = 0.000669003 sec
IS_FORWARD 0, Elapsed time = 0.262641 sec
IS_FORWARD 1, Elapsed time = 0.000679016 sec
IS_FORWARD 0, Elapsed time = 0.262704 sec
IS_FORWARD 1, Elapsed time = 0.000658989 sec
IS_FORWARD 0, Elapsed time = 0.262794 sec
IS_FORWARD 1, Elapsed time = 0.000673056 sec
IS_FORWARD 0, Elapsed time = 0.263157 sec
IS_FORWARD 1, Elapsed time = 0.000689983 sec
IS_FORWARD 0, Elapsed time = 0.262993 sec
IS_FORWARD 1, Elapsed time = 0.000663996 sec
IS_FORWARD 0, Elapsed time = 0.262549 sec

The duration of backward is beyond 260ms.

Config:

####attnDataType    = 0 (FP32)
#### attnNumHeads    = 16
#### attnBatchSize   = 1
#### attnBeamSize    = 1
#### attnSmScaler    = 1.0000e+00
#### attnDropoutRate = 0.0000
#### attnQsize       = 1024
#### attnKsize       = 1024
#### attnVsize       = 1024
#### attnProjQsize   = 64
#### attnProjKsize   = 64
#### attnProjVsize   = 64
#### attnProjOsize   = 1024
#### attnSeqLenQ     = 384
#### attnSeqLenK     = 384
#### attnDataLayout  = 0 (T,N,B,V)
#### attnResLink     = 0
#### attnSweep       = 0
#### attnRandGeom    = 0
#### attnRandSeed    = 1234
#### attnFileDump    = 0

Any suggestion to improve the performance?

You can reproduce the performance issue by cudnn multiHeadAttention sample. The below 3 APIs need 262ms. Is it reasonable?

cudnnMultiHeadAttnForward
cudnnMultiHeadAttnBackwardData
cudnnMultiHeadAttnBackwardWeights

I have implemented tensorflow custom op with this cudnn API and replaced Bert multi-head-attention. The default Bert training throughput is 13 samples/sec, after replacing, the throughput is only 2.7 samples/sec.

Hi @ding9801 thanks for your interest in cuDNN MHA!

Yes, the backward pass is not well optimized yet, as previously we were focusing on the forward inference use cases. We have engineers starting to work on the back prop right now, and we hope to deliver a much better optimized backward pass in the next few public releases.