LLM Model Pruning and Knowledge Distillation with NVIDIA NeMo Framework

Originally published at: LLM Model Pruning and Knowledge Distillation with NVIDIA NeMo Framework | NVIDIA Technical Blog

Model pruning and knowledge distillation are powerful cost-effective strategies for obtaining smaller language models from an initial larger sibling.  Pruning: Either drop layers (depth-pruning) or drop neurons, attention heads, and embedding channels (width-pruning).  Knowledge distillation: Transfer knowledge from a large teacher model to a smaller student model, with the goal of creating a more efficient,…