I am trying to deploy my asr system in Riva and facing issue in deploying custom Kenlm language model. Steps followed - :
Nemo model training :
Model : citrinet_512
Tokenizer : subword (bpe of sentence piece tokenizer)
Language model : Kenlm 3 gram
language : Filipino
Issue : Nemo provides us the script(/NeMo/scripts/asr_language_modeling/ngram_lm/train_kenlm.py) to train a kenlm language model taking manifest as input. It seems like this script converts the training manifest in to character level format using offset and tokenizer. After that language model is trained as character level language model.
Issue in riva : I passed nemo model, language model binary, decoding vocab to riva build command but getting empty transcripts during Inference.
how riva is going to know that language model is trained as character level model and how riva will decode it.
Decoding Vocab analysis: Considering language model is trained on character level so I tokenized my corpus using subword tokenizer used in nemo model then convert each word(new_char) in to character level by :
offset = 100
vocab_number = convert each word of tokenized corpus into id.
new_char = chr(vocab_number + offset)
unique_vocabs = set()
offset = 100
with open(corpus) as crp:
for i in crp:
text = i.split(" ")
for t in text:
curr_vocab = ""
for j in tokenizer.text_to_ids(t):
curr_vocab = curr_vocab + chr(j+offset)
unique_vocabs.add(curr_vocab)
with open(decoder_vocab_path, "w") as vcb:
for i in unique_vocabs:
vcb.write(i)
vcb.write('\n')
I Passed decoder_vocab_path to riva build command. Sample vocab →
vocab.txt (111.6 KB)
Is this the right way to make decoding vocab for a character level language model to deploy it in riva?
Any help would be appreciated.