Pretraining BERT with Layer-wise Adaptive Learning Rates

jwitsoe · August 25, 2020, 11:49pm

Originally published at: Pretraining BERT with Layer-wise Adaptive Learning Rates | NVIDIA Technical Blog

Training with larger batches is a straightforward way to scale training of deep neural networks to larger numbers of accelerators and reduce the training time. However, as the batch size increases, numerical instability can appear in the training process. The purpose of this post is to provide an overview of one class of solutions to…

Topic		Replies	Views
Understanding Natural Language with Deep Neural Networks Using Torch Technical Blog	18	464	September 26, 2016
Efficient BERT: Finding Your Optimal Model with Multimetric Bayesian Optimization, Part 3 Technical Blog	0	385	August 24, 2020
Use TenserRT2.1 for LSTM layer with peephole and projection GPU-Accelerated Libraries	0	487	September 21, 2017
NVIDIA Announces TensorRT 8 Slashing BERT-Large Inference Down to 1 Millisecond Technical Blog	0	458	July 20, 2021
layer norm for cudnn lstm cuDNN	0	959	June 28, 2018
Real-Time Natural Language Processing with BERT Using NVIDIA TensorRT (Updated) Technical Blog	0	525	July 20, 2021
how to copy the weights for Batch_Normalization layer in Network API as there is no corresponding layer for it ? TensorRT	3	999	October 12, 2021
Efficient BERT: Finding Your Optimal Model with Multimetric Bayesian Optimization, Part 2 Technical Blog	0	304	August 24, 2020
NVIDIA Makes BERT Fly Technical Blog	0	252	August 21, 2022
Learning rate scheduler for detectnet_v2 TAO Toolkit	13	1636	October 12, 2021

Pretraining BERT with Layer-wise Adaptive Learning Rates

Related topics