Streamlining Data Processing for Domain Adaptive Pretraining with NVIDIA NeMo Curator

Originally published at: https://developer.nvidia.com/blog/streamlining-data-processing-for-domain-adaptive-pretraining-with-nvidia-nemo-curator/

Domain-adaptive pretraining (DAPT) of large language models (LLMs) is an important step towards building domain-specific models. These models demonstrate greater capabilities in domain-specific tasks compared to their off-the-shelf open or commercial counterparts.  Recently, NVIDIA published a paper about ChipNeMo, a family of foundation models that are geared toward industrial chip design applications. ChipNeMo models are…