Hi NVIDIA team,
I am trying to follow the Riva / NeMo tutorial for n-gram LM training and fine-tuning:
“How To Train, Evaluate, and Fine-Tune an n-gram Language Model”
(official Riva tutorial)
My use case is with the model nvidia/parakeet-ctc-0.6b-Vietnamese.
From the tutorial, the LM adaptation workflow appears to require:
-
generating an intermediate ARPA file, and
-
using
ngram_merge.py/ interpolation between a base LM and a domain LM.
However, in the Hugging Face repo for nvidia/parakeet-ctc-0.6b-Vietnamese, I can only find:
-
a KenLM
.binfile -
a lexicon file
I do not see the original .arpa LM.
So I would like to ask:
-
Is the original ARPA LM for this model available anywhere?
-
If not, what is the recommended way to adapt the provided LM for a specific domain?
-
Should we train a new n-gram LM from text and use that directly for decoding?
-
Or is there any supported way to recover / export ARPA from the provided KenLM
.bin?
My understanding is that without the original ARPA, the official NeMo interpolation workflow cannot be applied directly to the released LM artifact.
Could you please advise on the recommended workflow for domain adaptation in this case?
Thanks a lot.