Bugreport: Riva ASR Citrinet with WPE tokenizer

Hi,
I have created custom citrinet model with nemo using WPE tokenizer on EN dataset. This model is transcribing wavs correctly when using in “nemo world” in jupyter notebook.

After that I converted it to riva using nemo2riva tool

When I tried to build it using riva-build there was error

2021-10-02 08:07:26,365 [INFO] Packing binaries for nn
2021-10-02 08:07:27,086 [INFO] Trying to extract from model test.riva
2021-10-02 08:07:27,716 [INFO] Packing binaries for lm_decoder
2021-10-02 08:07:27,716 [INFO] Trying to copy model binary from /tmp/tmpxnxemwyf/vocab.txt into rmir at /data/rmir/test.rmir.
2021-10-02 08:07:28,435 [INFO] Trying to extract from model test.riva
2021-10-02 08:07:29,075 [WARNING] Could not extract tokenizer.model to /tmp/tmpu99di62d, continuing
2021-10-02 08:07:29,085 [WARNING] Continuing with next model after failed extract
2021-10-02 08:07:29,086 [ERROR] Could not find required binary for lm_decoder at location tokenizer.model
NoneType: None
2021-10-02 08:07:29,086 [INFO] Packing binaries for rescorer
2021-10-02 08:07:29,087 [INFO] Trying to copy model binary from /tmp/tmpxnxemwyf/vocab.txt into rmir at /data/rmir/test.rmir.
2021-10-02 08:07:29,088 [INFO] Packing binaries for vad
2021-10-02 08:07:29,088 [INFO] Trying to copy model binary from /tmp/tmpxnxemwyf/vocab.txt into rmir at /data/rmir/test.rmir.
Exception ignored in: <function RMIR.del at 0x7f82fb6e75e0>
Traceback (most recent call last):
File “/opt/conda/lib/python3.8/site-packages/servicemaker/rmir/rmir.py”, line 49, in del
self._eff_ctx.exit(None, None, None)
File “/opt/conda/lib/python3.8/contextlib.py”, line 120, in exit
next(self.gen)
File “”, line 327, in create
File “”, line 263, in save_to
File “”, line 34, in crc32_file
FileNotFoundError: [Errno 2] No such file or directory: ‘/tmp/tmpw517zv1h/lm_decoder-tokenizer.model’

after some digging i find out that in citrinet model using SPE tokenizer there is file lm_decoder-tokenizer.model packed in nemo.

Then I figure out that riva-build is python package. And after hour with crazy debug print warnings written in vim I detect problem.

In file /opt/conda/lib/python3.8/site-packages/servicemaker/pipelines/asr.py on line 351 I have to change if self.tokenizer_model != "": to if self.tokenizer_model == "spe":

After this “quickhack” it is magically working and build and deploy is performed without errors and riva started successfully (and custom model is possible to call).

Is there some official bugreport for riva?

Thanks for great work!
Tomas

Hi,

Could you please share us issue repro steps and model to try from our end.

Thank you.

FYI. There’s also another bug, which prevents deploying any model that does not use the SentencePiece unigram type tokenizer. Indeed, /opt/conda/lib/python3.8/site-packages/servicemaker/triton/asr.py line 457 calls return sp.NBestEncodeAsPieces(l, nbest_size), which will return an empty list if the tokenizer is not of unigram type. See Google Colab and Google Colab.

Thanks for this report. We currently only support SentencePiece encoding for subword ASR models, but have filed a bug for us to add support for other tokenizations officially.