GTC 2020: Training Biomedical & Clinical Language Models using BERT

GTC 2020 S21108
Presenters: Raghav Mani,NVIDIA; hoo chang shin,NVIDIA; Anthony Costa,Icahn School of Medicine at Mount Sinai; Eric Oermann,Icahn School of Medicine at Mount Sinai
Abstract
Pre-trained language models like BERT that are built using Transformer networks have produced greatly improved performance in a wide variety of NLP tasks. However, making BERT perform as well on other domain-specific text corpora, such as in the biomedical domain, is not straightforward. The NVIDIA team will describe the general trends in the evolution of these language models, and the tools they’ve created to efficiently train large domain specific language models like BioBERT.

The Mt. Sinai team will then talk about how they’re applying these tools and techniques to build clinical language models using what’s potentially the largest corpus of medical text to date, almost 8 times larger than the Wikipedia Corpus. Given the difficulty in accessing massive clinical datasets, approaches to improving masked language model performance on clinical tasks have focused on transfer learning and ever expanding parameter sizes. The Mount Sinai team will discuss the implications of increasing pretraining data availability by orders of magnitude to mass language models, evaluating these pre-trained models on established clinical tasks, such as named entity recognition and 30 day readmissions. They’ll delve into the model architecture, NLP experiments, and GPU configuration they used to facilitate this study.

Watch this session
Join in the conversation below.