How concat weights for cudnnMultiHeadAttnForward dw

ding9801 · July 5, 2021, 2:46am

How can I concat weights for cudnnMultiHeadAttnForward dw?
I get weights by tensorflow tf.get_variable(h, w)
Can I do like the below?
concat (wq.flaten, wk.flaten, wv.flaten, wo.flaten)
Why " The cudnnMultiHeadAttnBackwardData() function must be invoked after
cudnnMultiHeadAttnForward()" ? Forward has long latency.

mlperf1.0 bert did not use cudnnMultiHeadAttnForward /backward ? Do you have plan to optimize it? training_results_v1.0/NVIDIA/benchmarks/bert/implementations/pytorch/mhalib at master · mlcommons/training_results_v1.0 · GitHub

AakankshaS · July 5, 2021, 5:22pm

Hi @ding9801 ,
Please allow us sometime to get back to you on this.
Thank you for your patience.

ding9801 · July 6, 2021, 1:33pm

Thanks.
Look forward to your update.

ding9801 · July 8, 2021, 1:45am

Can you give the update?
No user uses this API ?

sstepniewski · July 8, 2021, 6:55pm

@ding9801: Please download cuDNN samples. The “multiHeadAttention” sample shows how to access MHA weights. See the saveAllParams() function. The weight layout is the same in fwd and bwd calls.

sstepniewski · July 8, 2021, 7:01pm

The “multiHeadAttention” sample also shows how to invoke cudnnMultiHeadAttnForward(), cudnnMultiHeadAttnBackwardData(), and cudnnMultiHeadAttnBackwardWeights() API-s. The sample code has also a simple reference model in “attn_ref.py”.

sstepniewski · July 8, 2021, 7:03pm

Yes, we have plans to further optimize this API implementation.

ding9801 · July 9, 2021, 1:52am

Can you give the weights layout more details? I found wq, wk, wv is the same, but wo is different.
wq, wk, wv: seq1-head1, seq1-head2, seq1-head3, … seq2-head1, seq2-head2, seq2-head3…
wo: seq1-head1, seq2-head1, seq3-head1, …, seq1-head2, seq2-head2, seq3-head2…
Is it right?

Can you answer my second question? ----Why " The cudnnMultiHeadAttnBackwardData() function must be invoked after cudnnMultiHeadAttnForward()" ?

sstepniewski · July 9, 2021, 5:53pm

The multi-head attention API was designed in such a way that the weight layout is not fixed. The user needs to invoke cudnnGetMultiHeadAttnWeights() and obtain the layout of each group of weights. That way, the cuDNN library can select a weight layout that is optimal for the given GPU or math type. In the multi-head attention API we decided to be more flexible in comparison to the RNN API and report tensor dimensions with strides. That way the weight layout can change.

The cuDNN RNN API calls may change the weight layout internally as a GEMM (GEneral Matrix Multiply) speed optimization. In the multi-head attention API we wanted to eliminate this step (that could be performed in every API invocation) and instead communicate to the user that a particular weight layout would be used. So any on-the-fly weight transpose/padding could be avoided.

It is correct that the layout of output projection weights is different from other weights. Currently, output projection weights have the column-major layout and are concatenated horizontally as, for example, in the horzcat() function of MATLAB. Other weights have the column-major layout and are concatenated vertically as in the vertcat() function. Please remember that this layout choice is not fixed.

There could be small, unused memory gaps between groups of weights to guarantee better data alignment.

We had requests to support other special layouts, such as concatenating wq, wk, wv (with no gaps) whenever it makes sense. This is not implemented but we may support it via the attnMode argument of cudnnSetAttnDescriptor().

sstepniewski · July 9, 2021, 6:00pm

Why " The cudnnMultiHeadAttnBackwardData() function must be invoked after cudnnMultiHeadAttnForward()" ?

All cuDNN API-s follow the same design pattern. First, you need to invoke the “forward” function. Next, you need to call “backward” functions. In the “inference” mode, the “backward” API-s are not used. In the “training” mode, the “forward” call may save some intermediate results in the “reserve” buffer. Those results are consumed by the two “backward” calls: “backward data” and “backward weights”. So the sequence of calls is: (1) “forward” API, (2) “backward data”, (3) “backward weights”.

ding9801 · July 12, 2021, 3:17am

What does the “reserve” buffer save?
In tensorflow ops.RegisterGradient, backward op can get gradient, op forward input/output, attribute. What else does cudnnMultiHeadAttnBackward need?
I think cudnnMultiHeadAttnBackward is designed for training. Is there suggestion for tensorflow op or pytorch layer development?

sstepniewski · July 15, 2021, 4:42pm

The cuDNN library offers a lower level API that does not perform memory management. The caller needs to allocate and supply all input and output buffers. Some API-s need temporary storage just for the duration of one call. In the cuDNN nomenclature we call this storage a work-space buffer. Moreover, there could be a need for one more buffer type to exchange data between “forward”, “backward data”, and “backward weights” calls. We call this buffer, the reserve-space buffer. The type of information stored in the reserve-space buffer is dictated by back-propagation math of a particular DL model.

Yes, cudnnMultiHeadAttnBackwardWeights() is designed for training to compute exact, first order, partial derivatives of the error function with respect to all trainable model parameters. It is possible to automatically differentiate “forward” code based on the forward operation graph. cuDNN does not use this solution. The 'backward data” and “backward weights” routines are hand coded and optimized.

ding9801 · July 20, 2021, 9:40am

Thanks for your reply.

I find the performance is bad.

I add the timer as the below:

For IS_FORWARD == 1

double start = seconds();
cudnnMultiHeadAttnForward
cudaDeviceSynchronize();
double stop = seconds();

For backward, IS_FORWARD == 0

double start = seconds();
cudnnMultiHeadAttnForward
cudnnMultiHeadAttnBackwardData
cudnnMultiHeadAttnBackwardWeights
cudaDeviceSynchronize();
double stop = seconds();

duration = stop - start；

The result:

IS_FORWARD 1, Elapsed time = 0.000691891 sec
IS_FORWARD 0, Elapsed time = 0.262995 sec
IS_FORWARD 1, Elapsed time = 0.000668049 sec
IS_FORWARD 0, Elapsed time = 0.262635 sec
IS_FORWARD 1, Elapsed time = 0.000663996 sec
IS_FORWARD 0, Elapsed time = 0.263343 sec
IS_FORWARD 1, Elapsed time = 0.000674963 sec
IS_FORWARD 0, Elapsed time = 0.263066 sec
IS_FORWARD 1, Elapsed time = 0.000671148 sec
IS_FORWARD 0, Elapsed time = 0.262862 sec
IS_FORWARD 1, Elapsed time = 0.000673056 sec
IS_FORWARD 0, Elapsed time = 0.262764 sec
IS_FORWARD 1, Elapsed time = 0.000664949 sec
IS_FORWARD 0, Elapsed time = 0.262851 sec
IS_FORWARD 1, Elapsed time = 0.000669003 sec
IS_FORWARD 0, Elapsed time = 0.262641 sec
IS_FORWARD 1, Elapsed time = 0.000679016 sec
IS_FORWARD 0, Elapsed time = 0.262704 sec
IS_FORWARD 1, Elapsed time = 0.000658989 sec
IS_FORWARD 0, Elapsed time = 0.262794 sec
IS_FORWARD 1, Elapsed time = 0.000673056 sec
IS_FORWARD 0, Elapsed time = 0.263157 sec
IS_FORWARD 1, Elapsed time = 0.000689983 sec
IS_FORWARD 0, Elapsed time = 0.262993 sec
IS_FORWARD 1, Elapsed time = 0.000663996 sec
IS_FORWARD 0, Elapsed time = 0.262549 sec

The duration of backward is beyond 260ms.

Config:

####attnDataType    = 0 (FP32)
#### attnNumHeads    = 16
#### attnBatchSize   = 1
#### attnBeamSize    = 1
#### attnSmScaler    = 1.0000e+00
#### attnDropoutRate = 0.0000
#### attnQsize       = 1024
#### attnKsize       = 1024
#### attnVsize       = 1024
#### attnProjQsize   = 64
#### attnProjKsize   = 64
#### attnProjVsize   = 64
#### attnProjOsize   = 1024
#### attnSeqLenQ     = 384
#### attnSeqLenK     = 384
#### attnDataLayout  = 0 (T,N,B,V)
#### attnResLink     = 0
#### attnSweep       = 0
#### attnRandGeom    = 0
#### attnRandSeed    = 1234
#### attnFileDump    = 0

Any suggestion to improve the performance?

ding9801 · July 21, 2021, 1:24am

You can reproduce the performance issue by cudnn multiHeadAttention sample. The below 3 APIs need 262ms. Is it reasonable?

cudnnMultiHeadAttnForward
cudnnMultiHeadAttnBackwardData
cudnnMultiHeadAttnBackwardWeights

I have implemented tensorflow custom op with this cudnn API and replaced Bert multi-head-attention. The default Bert training throughput is 13 samples/sec, after replacing, the throughput is only 2.7 samples/sec.

yanxu · July 22, 2021, 7:28am

Hi @ding9801 thanks for your interest in cuDNN MHA!

Yes, the backward pass is not well optimized yet, as previously we were focusing on the forward inference use cases. We have engineers starting to work on the back prop right now, and we hope to deliver a much better optimized backward pass in the next few public releases.

ding9801 · December 13, 2021, 3:18am

I find “Significant performance improvement …” in 8.3.0 release notes. Thanks.

https://docs.nvidia.com/deeplearning/cudnn/release-notes/rel_8.html#rel-830
“Significant performance improvement out of the box (no changes are required from users) for both multi-head attention forward and backward paths.”

yanxu · December 15, 2021, 6:51pm

Hi @ding9801 yes, we have made major improvements in cuDNN 8.3.0, and additional small fixes and improvements have been made in the following releases (8.3.1, and the upcoming 8.3.2). I would recommend you try out the latest release. Thanks!

system · December 29, 2021, 6:52pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
MultiHeadAttnBackwardData Wrong Result with postDropout enabled cuDNN	1	901	July 8, 2022
Incomplete documentation of cudnnMultiHeadAttnForward cuDNN	2	584	August 24, 2019
MultiHeadAttn cuDNN	1	600	August 10, 2022
MultiHeadAttnForward Result cuDNN cuda	7	1216	January 13, 2023
SeqData and Multi-head Attention cuDNN cuda	2	903	February 22, 2025
SeqDataDesc and MultiHeadAttn Parameters cuDNN cuda	3	833	July 13, 2022
cudnnMultiHeadAttnForward with bad params cuDNN cuda	1	882	March 31, 2023
Use of cudnn rnn forwardtraining and backwardtraining cuDNN	11	2203	October 13, 2018
Multi-head attention performance cuDNN	1	1006	August 12, 2022
Use of cuDNN RNN cuDNN	16	6651	September 20, 2018

How concat weights for cudnnMultiHeadAttnForward dw

Related topics