Training Localized Multilingual LLMs with NVIDIA NeMo, Part 1

jwitsoe · May 17, 2024, 5:29pm

Originally published at: https://developer.nvidia.com/blog/training-localized-multilingual-llms-with-nvidia-nemo-part-1/

In today’s globalized world, the ability of AI systems to understand and communicate in diverse languages is increasingly crucial. Large language models (LLMs) have revolutionized the field of natural language processing, enabling AI to generate human-like text, answer questions, and perform various language tasks. However, most mainstream LLMs are trained on data corpora that primarily…

nluo1 · May 20, 2024, 1:57pm

In this post, we shared about training a localized LLM with additional language support from foundation LLM. We have mainly covered how to train a customized tokenizer (Part1) and how to integrate the customized tokenizer for continual pretraining in NeMo (Part 2).

This pipeline uses Thai dataset as an example. Please do try out and share your feedback on constructing localized multilinugal LLM using NeMo

nishesh1 · August 26, 2024, 8:03am

Can this process be used for any other llm models? or are there any limitations?

ambleiweiss · August 27, 2024, 6:14pm

You can repeat this process for any of the LLMs supported by NeMo; LLama, Mistral, Gemma, Mistral, and many more. See the full list here: Large Language Models — NVIDIA NeMo Framework User Guide latest documentation

Topic		Replies	Views
Training Localized Multilingual LLMs with NVIDIA NeMo, Part 2 Technical Blog	2	163	May 20, 2024
Customizing Neural Machine Translation Models with NVIDIA NeMo, Part 2 Technical Blog	1	143	May 13, 2024
Customizing Neural Machine Translation Models with NVIDIA NeMo, Part 1 Technical Blog	1	158	May 13, 2024
Mastering LLM Techniques: Data Preprocessing Technical Blog	1	54	November 13, 2024
Mastering LLM Techniques: Training Technical Blog	0	465	November 16, 2023
Deploy Multilingual LLMs with NVIDIA NIM Technical Blog	4	116	July 14, 2024
Post-Training Quantization of LLMs with NVIDIA NeMo and NVIDIA TensorRT Model Optimizer Technical Blog	1	22	September 10, 2024
Curating Trillion-Token Datasets: Introducing NVIDIA NeMo Data Curator Technical Blog	0	396	August 8, 2023
Multilingual and Code-Switched Automatic Speech Recognition with NVIDIA NeMo Technical Blog	0	395	January 31, 2023
LLM 기술 마스터하기: 인퍼런스 최적화 Technical Blog - South Korea	0	540	November 27, 2023

Training Localized Multilingual LLMs with NVIDIA NeMo, Part 1

Related topics