Please find guidelines to update lexicon file mapping below – I’ll request engineering to add a section to the documentation.
With the Citrinet acoustic model, the riva-build command from the docs is:
riva-build speech_recognition -f citrinet.rmir citrinet.riva --name=citrinet --decoder_type=flashlight --chunk_size=0.16 --padding_size=1.92 --ms_per_timestep=80 --flashlight_decoder.asr_model_delay=-1 --featurizer.use_utterance_norm_params= False --featurizer.precalc_norm_time_steps=0 --featurizer.precalc_norm_params= False --decoding_language_model_binary=lm.binary --decoding_vocab=decoding_vocab.txt
The first thing that the customer should verify is that the word is in the decoder lexicon file passed using --decoding_vocab parameter (decoding_vocab.txt in command above). For example, if customer wants the word “manu” to be predicted, he should add a line to decoding_vocab.txt with word “manu”. If adding this word to the decoder lexicon doesn’t resolve the issue, the customer should also try the following.
First, generate the model repository by running riva-deploy:
riva-deploy -f citrinet.rmir /data/models/
This will generate the lexicon file used by decoder at: /data/models/citrinet-ctc-decoder-cpu-streaming/1/lexicon.txt
The customer should make a copy of that file:
cp /data/models/citrinet-ctc-decoder-cpu-streaming/1/lexicon.txt decoding_lexicon.txt
and modify it to add the sentencepiece encoding for the word. For example, one could could add:
manu ▁ma n u
manu ▁man n n ew
manu ▁man n ew
to file decoding_lexicon.txt. With those changes, if acoustic model predicts “▁ma n u” or “▁man n n ew” or “▁man n ew”, manu should be predicted by decoder. Customer should make sure that new lines follow indentation/space pattern of the rest of the file and that the tokens used are part of the tokenizer model. Once this is done, customer can regenerate the model repository using that new decoding lexicon by passing --decoding_lexicon=decoding_lexicon.txt to riva-build instead of -decoding_vocab=decoding_vocab.txt.
riva-build speech_recognition -f citrinet.rmir citrinet.riva --name=citrinet --decoder_type=flashlight --chunk_size=0.16 --padding_size=1.92 --ms_per_timestep=80 --flashlight_decoder.asr_model_delay=-1 --featurizer.use_utterance_norm_params= False --featurizer.precalc_norm_time_steps=0 --featurizer.precalc_norm_params= False --decoding_language_model_binary=lm.binary --decoding_vocab=decoding_vocab.txt --decoding_lexicon=decoding_lexicon.txt
riva-deploy -f citrinet.rmir /data/models/