Originally published at: Pretraining BERT with Layer-wise Adaptive Learning Rates | NVIDIA Technical Blog
Training with larger batches is a straightforward way to scale training of deep neural networks to larger numbers of accelerators and reduce the training time. However, as the batch size increases, numerical instability can appear in the training process. The purpose of this post is to provide an overview of one class of solutions to…