How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model

jwitsoe · August 14, 2024, 3:50pm

Originally published at: https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/

Large language models (LLM) are now a dominant force in natural language processing and understanding, thanks to their effectiveness and versatility. LLMs such as Llama 3.1 405B and NVIDIA Nemotron-4 340B excel in many challenging tasks, including coding, reasoning, and math. They are, however, resource-intensive to deploy. As such, there is another trend in the…

ramkumarkoppu · August 15, 2024, 10:16am

Can you share the github repo or notebook which runs this flow? from bigger llama model to 4B llama model by utilizing pruning and distillation please?

matias11 · August 15, 2024, 6:12pm

The Llama-3.1-Minitron model links on Hugging Face are currently broken.

crk.nft.art · August 16, 2024, 12:37pm

If I may, I’d like to contribute an instruction set that I developed along with A.I. that initially showed 40-60% percent increase in efficiency and energy consumption for already trained models. This article reminded me of it. I’m also experimenting with a weighted node system as well for a project. Great minds think alike! Let me know if you’re interested. Here is the instructions set feel free to refine for your model as needed:

sharatht · August 27, 2024, 10:48pm

The links have been updated. Please try it out now!

sharatht · August 27, 2024, 10:49pm

Distillation support is current present in NeMo. (NeMo/docs/source/nlp/distillation.rst at main · NVIDIA/NeMo · GitHub)

Notebook and examples will be released soon in the coming weeks, stay tuned!

tjimmy823 · September 4, 2024, 1:41am

I’m interested with how you were able to hit XLSum en rougeL 0.3005 with Llama-3.1 8B. can you share how this was produced? Would like to replicate your prompts. thank you.

sharatht · October 4, 2024, 6:03pm

Notebook example as been added to NeMo! NeMo/tutorials/llm/llama-3/pruning-distillation at main · NVIDIA/NeMo · GitHub

Topic		Replies	Views
LLM Model Pruning and Knowledge Distillation with NVIDIA NeMo Framework Technical Blog	2	87	July 12, 2025
Mistral-NeMo-Minitron 8B Foundation Model Delivers Unparalleled Accuracy Technical Blog	1	64	August 21, 2024
Advancing the Accuracy-Efficiency Frontier with Llama-3.1-Nemotron-51B Technical Blog llama	3	109	October 24, 2024
Accelerating Peoplnet with tlt for jetson nano TAO Toolkit	19	2590	October 12, 2021
Deepstream 5.0 + TLT Deep Learning (Training & Inference)	0	281	September 1, 2020
Supercharging Llama 3.1 across NVIDIA Platforms Technical Blog	14	299	September 17, 2024
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Technical Blog	8	1886	January 25, 2024
NVIDIA TensorRT-LLM으로 LoRA LLM 조정 및 배포 Technical Blog - South Korea	1	259	April 18, 2024
Tool for pruning TensorRT	2	1629	October 12, 2021
Tune and Deploy LoRA LLMs with NVIDIA TensorRT-LLM Technical Blog	3	576	April 18, 2024

How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model

Related topics