Update lexicon file - guide - for Citrinet

Please find guidelines to update lexicon file mapping below – I’ll request engineering to add a section to the documentation.

With the Citrinet acoustic model, the riva-build command from the docs is:

riva-build speech_recognition -f citrinet.rmir citrinet.riva --name=citrinet --decoder_type=flashlight --chunk_size=0.16 --padding_size=1.92 --ms_per_timestep=80 --flashlight_decoder.asr_model_delay=-1 --featurizer.use_utterance_norm_params= False --featurizer.precalc_norm_time_steps=0 --featurizer.precalc_norm_params= False --decoding_language_model_binary=lm.binary --decoding_vocab=decoding_vocab.txt

The first thing that the customer should verify is that the word is in the decoder lexicon file passed using --decoding_vocab parameter (decoding_vocab.txt in command above). For example, if customer wants the word “manu” to be predicted, he should add a line to decoding_vocab.txt with word “manu”. If adding this word to the decoder lexicon doesn’t resolve the issue, the customer should also try the following.

First, generate the model repository by running riva-deploy:

riva-deploy -f citrinet.rmir /data/models/

This will generate the lexicon file used by decoder at: /data/models/citrinet-ctc-decoder-cpu-streaming/1/lexicon.txt

The customer should make a copy of that file:

cp /data/models/citrinet-ctc-decoder-cpu-streaming/1/lexicon.txt decoding_lexicon.txt

and modify it to add the sentencepiece encoding for the word. For example, one could could add:

manu ▁ma n u

manu ▁man n n ew

manu ▁man n ew

to file decoding_lexicon.txt. With those changes, if acoustic model predicts “▁ma n u” or “▁man n n ew” or “▁man n ew”, manu should be predicted by decoder. Customer should make sure that new lines follow indentation/space pattern of the rest of the file and that the tokens used are part of the tokenizer model. Once this is done, customer can regenerate the model repository using that new decoding lexicon by passing --decoding_lexicon=decoding_lexicon.txt to riva-build instead of -decoding_vocab=decoding_vocab.txt.

riva-build speech_recognition -f citrinet.rmir citrinet.riva --name=citrinet --decoder_type=flashlight --chunk_size=0.16 --padding_size=1.92 --ms_per_timestep=80 --flashlight_decoder.asr_model_delay=-1 --featurizer.use_utterance_norm_params= False --featurizer.precalc_norm_time_steps=0 --featurizer.precalc_norm_params= False --decoding_language_model_binary=lm.binary --decoding_vocab=decoding_vocab.txt --decoding_lexicon=decoding_lexicon.txt

riva-deploy -f citrinet.rmir /data/models/

1 Like

Hi,

We are currently checking on this, please allow us sometime.
Will get back to you.

Thanks

Are there any updates here? We are having some trouble using lexicon.

Hello @eswissa ,

Nice post! It is really helpful.

I am currently trying to build a language model on top of Citrinet. I need some medical terms to be predicted and then I created a vocab file with those words. However, when running RIVA ASR, the transcription results weren’t the expected.

Then, I came across to this guideline and followed the instructions to create a lexicon file hoping it would help improving the LM.
As an example, one of the words that still needs room for improvement in the transcription is the gene BRAF. It was being predicted and transcribed as b raf. Hence I made an experiment and created the following entry in the lexicon file:

braf ▁b raf

After that, I built again the language model with riva-build command and then using Riva Quick Start to init and deploy the Riva Server. When executing the bash script riva_start.sh, the following error appears:

E:ctc-decoder-cpu.cc:269: Cannot initialize decoders. Error msg: Unknown entry in dictionary: 'raf’E1125 09:55:24.926748 73 ctc-decoder-cpu.cc:270] Invalid parameters in model configuration

I guess the token raf should be placed somewhere in the model. What is the correct procedure to overcome this error?

Enclosed there is the riva-speech server log file when loading the model.
riva-speech-start-logs-2411.txt (8.0 KB)

Thanks in advance,

1 Like

Is there an index of these info posts?
Just FYI only happened to come across this thread while going through the forum, but I definitely would not have been able to find it otherwise. IMHO, the discussion and activity on GitHub seems much more organic and easy to participate in (like on NeMo’s github), hopefully something like that can be done to organize this info?

Also would really appreciate if we could enumerate all such posts by employees that guide usage.

Hi @arturo.rodrigues , I reached out to the team to see if we can help unblock you on this. Please stay tuned.

Hi @fernando.godino , can you elaborate on the issues you’re experiencing? Any specific errors/logs or commands to reproduce the error?

Hi @arturo.rodrigues,

You have to use tokens the model has been trained on. To do this, you’ll need the tokenizer model and the sentencepiece python package (pip install sentencepiece). You can get the tokenizer model for the deployed pipeline from the model repository ctc decoder directory for your model. It will be named <hash>_tokenizer.model.

You can then generate new lexicon entries with a command like this:

$ PRONUNCIATION="b raf"; TOKEN="BRAF"; echo "$PRONUNCIATION" | \
    spm_encode --model=[hash]_tokenizer.model --output_format=nbest_piece --nbest_size=4 | \
    sed "s/^/$TOKEN\t/"

For my tokenizer model, the output is as follows. Simply append the output of the command to your lexicon file and restart the server:

BRAF    ▁b ▁ra f
BRAF    ▁b ▁ ra f
BRAF    ▁ b ▁ra f
BRAF    ▁b ▁r a f

Another example:

$ PRONUNCIATION="brah kuh two"; TOKEN="BRCA2"; echo "$PRONUNCIATION" | \
    spm_encode --model=[hash]_tokenizer.model --output_format=nbest_piece --nbest_size=4 | \
    sed "s/^/$TOKEN\t/ >> lexicon.txt"
$ tail -n 4 lexicon.txt
BRCA2   ▁bra h ▁k u h ▁two
BRCA2   ▁b ra h ▁k u h ▁two
BRCA2   ▁bra h ▁ k u h ▁two
BRCA2   ▁b r a h ▁k u h ▁two

Hope this helps.

2 Likes