Update lexicon file - guide - for Citrinet

Please find guidelines to update lexicon file mapping below – I’ll request engineering to add a section to the documentation.

With the Citrinet acoustic model, the riva-build command from the docs is:

riva-build speech_recognition -f citrinet.rmir citrinet.riva --name=citrinet --decoder_type=flashlight --chunk_size=0.16 --padding_size=1.92 --ms_per_timestep=80 --flashlight_decoder.asr_model_delay=-1 --featurizer.use_utterance_norm_params= False --featurizer.precalc_norm_time_steps=0 --featurizer.precalc_norm_params= False --decoding_language_model_binary=lm.binary --decoding_vocab=decoding_vocab.txt

The first thing that the customer should verify is that the word is in the decoder lexicon file passed using --decoding_vocab parameter (decoding_vocab.txt in command above). For example, if customer wants the word “manu” to be predicted, he should add a line to decoding_vocab.txt with word “manu”. If adding this word to the decoder lexicon doesn’t resolve the issue, the customer should also try the following.

First, generate the model repository by running riva-deploy:

riva-deploy -f citrinet.rmir /data/models/

This will generate the lexicon file used by decoder at: /data/models/citrinet-ctc-decoder-cpu-streaming/1/lexicon.txt

The customer should make a copy of that file:

cp /data/models/citrinet-ctc-decoder-cpu-streaming/1/lexicon.txt decoding_lexicon.txt

and modify it to add the sentencepiece encoding for the word. For example, one could could add:

manu ▁ma n u

manu ▁man n n ew

manu ▁man n ew

to file decoding_lexicon.txt. With those changes, if acoustic model predicts “▁ma n u” or “▁man n n ew” or “▁man n ew”, manu should be predicted by decoder. Customer should make sure that new lines follow indentation/space pattern of the rest of the file and that the tokens used are part of the tokenizer model. Once this is done, customer can regenerate the model repository using that new decoding lexicon by passing --decoding_lexicon=decoding_lexicon.txt to riva-build instead of -decoding_vocab=decoding_vocab.txt.

riva-build speech_recognition -f citrinet.rmir citrinet.riva --name=citrinet --decoder_type=flashlight --chunk_size=0.16 --padding_size=1.92 --ms_per_timestep=80 --flashlight_decoder.asr_model_delay=-1 --featurizer.use_utterance_norm_params= False --featurizer.precalc_norm_time_steps=0 --featurizer.precalc_norm_params= False --decoding_language_model_binary=lm.binary --decoding_vocab=decoding_vocab.txt --decoding_lexicon=decoding_lexicon.txt

riva-deploy -f citrinet.rmir /data/models/

Hi,

We are currently checking on this, please allow us sometime.
Will get back to you.

Thanks

Are there any updates here? We are having some trouble using lexicon.

Hello @eswissa ,

Nice post! It is really helpful.

I am currently trying to build a language model on top of Citrinet. I need some medical terms to be predicted and then I created a vocab file with those words. However, when running RIVA ASR, the transcription results weren’t the expected.

Then, I came across to this guideline and followed the instructions to create a lexicon file hoping it would help improving the LM.
As an example, one of the words that still needs room for improvement in the transcription is the gene BRAF. It was being predicted and transcribed as b raf. Hence I made an experiment and created the following entry in the lexicon file:

braf ▁b raf

After that, I built again the language model with riva-build command and then using Riva Quick Start to init and deploy the Riva Server. When executing the bash script riva_start.sh, the following error appears:

E:ctc-decoder-cpu.cc:269: Cannot initialize decoders. Error msg: Unknown entry in dictionary: 'raf’E1125 09:55:24.926748 73 ctc-decoder-cpu.cc:270] Invalid parameters in model configuration

I guess the token raf should be placed somewhere in the model. What is the correct procedure to overcome this error?

Enclosed there is the riva-speech server log file when loading the model.
riva-speech-start-logs-2411.txt (8.0 KB)

Thanks in advance,