Scaling Language Model Training to a Trillion Parameters Using Megatron

Originally published at:

Natural Language Processing (NLP) has seen rapid progress in recent years as computation at scale has become more available and datasets have become larger. At the same time, recent work has shown large language models to be effective few-shot learners, with high accuracy on many NLP datasets without additional finetuning. As a result, state-of-the-art NLP…

Happy to answer questions on the post or the work more broadly! More details are in our arXiv paper: [2104.04473] Efficient Large-Scale Language Model Training on GPU Clusters.

Our work is open sourced at GitHub - NVIDIA/Megatron-LM: Ongoing research training transformer language models at scale, including: BERT & GPT-2 and we would love for people to try it out!