Pretraining BERT with Layer-wise Adaptive Learning Rates

Originally published at: https://developer.nvidia.com/blog/pretraining-bert-with-layer-wise-adaptive-learning-rates/

Training with larger batches is a straightforward way to scale training of deep neural networks to larger numbers of accelerators and reduce the training time. However, as the batch size increases, numerical instability can appear in the training process. The purpose of this post is to provide an overview of one class of solutions to…