GTC 2020 S21496
Presenters: Mostofa Patwary,NVIDIA; Raul Puri,NVIDIA
We’ll cover an efficient model parallel approach by making only a few targeted modifications to existing PyTorch transformer implementations. Training the largest neural language model has recently been the best way to advance the state of the art in NLP applications. However, for models beyond a billion parameters, a single GPU doesn’t have enough memory to fit the model along with the training parameters, requiring model parallelism to split the parameters across multiple GPUs. We’ll showcase our approach by training an 8.3 billion parameter transformer language model with 8-way model parallelism and 64-way data parallelism on 512 GPUs, making it the largest transformer-based language model ever trained. This model establishes new state-of-the-art results in downstream tasks.
Watch this session
Join in the conversation below.