Originally published at: LLM Model Pruning and Knowledge Distillation with NVIDIA NeMo Framework | NVIDIA Technical Blog
Model pruning and knowledge distillation are powerful cost-effective strategies for obtaining smaller language models from an initial larger sibling. Pruning: Either drop layers (depth-pruning) or drop neurons, attention heads, and embedding channels (width-pruning). Knowledge distillation: Transfer knowledge from a large teacher model to a smaller student model, with the goal of creating a more efficient,…