I was experimenting with the Riva STT “word boosting” capabilities by following this doc. This doc said “Out-of-vocabulary (OOV) word boosting is supported.”
But when I try to boost the OOV word, I get
E1201 10:11:54.271695 77 streaming_asr_ensemble.cc:1286] Caught exception: Unable to encode the word <MY_OOV_boosted_WORD>
E1201 10:11:54.271879 1856266 grpc_riva_asr.cc:1465] Received error from Triton: Unable to encode the word <MY_OOV_boosted_WORD>
Then I found this doc. That says “Flashlight decoder used in Riva is a lexicon-based decoder and only emits words that are present in the decoder vocabulary file.”
So when I add the <MY_OOV_boosted_WORD> in the lexicon file, the above encoding error is solved.
Could you please tell me how I can boost the OOV words in Riva? If you need any other information please let me know.
I have the same problem. For example, for the RIVA Parakeet-CTC-XXL-1.1B ASR English model, word boosting works correctly. However, for Parakeet-TDT_CTC-110M, using word boosting causes the error Unable to encode the word <MY_OOV_boosted_WORD>. The difference between these models is in the tokenizer. parakeet-1.1b uses SentencePiece Unigram. Parakeet-110m uses BPE. If the words needed for boosting are added to the lexicon file, then boosting works for the Parakeet-110m model as well. But for out-of-vocabulary (OOV) words, RIVA encodes this words on the fly. And this feature only works with SentencePiece Unigram tokenizers. Is there a fix or instructions on how to boost OOV words for models with a BPE tokenizer?
Out of vocabulary means the words which are not present in vocab.txt used during riva-build.
The boosting will not work for characters out of model’s tokenizer.
Characters like ‘<‘, ‘>’ are not part of tokenizer hence the warning.
if you need to boost characters which are not part of model’s tokenizer, please wait for our next Parakeet CTC NIM release.