Originally published at: https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/
Large language models (LLM) are now a dominant force in natural language processing and understanding, thanks to their effectiveness and versatility. LLMs such as Llama 3.1 405B and NVIDIA Nemotron-4 340B excel in many challenging tasks, including coding, reasoning, and math. They are, however, resource-intensive to deploy. As such, there is another trend in the…
Can you share the github repo or notebook which runs this flow? from bigger llama model to 4B llama model by utilizing pruning and distillation please?
The Llama-3.1-Minitron model links on Hugging Face are currently broken.
If I may, I’d like to contribute an instruction set that I developed along with A.I. that initially showed 40-60% percent increase in efficiency and energy consumption for already trained models. This article reminded me of it. I’m also experimenting with a weighted node system as well for a project. Great minds think alike! Let me know if you’re interested. Here is the instructions set feel free to refine for your model as needed:
The links have been updated. Please try it out now!
Distillation support is current present in NeMo. (NeMo/docs/source/nlp/distillation.rst at main · NVIDIA/NeMo · GitHub)
Notebook and examples will be released soon in the coming weeks, stay tuned!
I’m interested with how you were able to hit XLSum en rougeL 0.3005 with Llama-3.1 8B. can you share how this was produced? Would like to replicate your prompts. thank you.