Bugreport: Riva ASR Citrinet with WPE tokenizer

tomas.lysek · October 2, 2021, 8:20am

Hi,
I have created custom citrinet model with nemo using WPE tokenizer on EN dataset. This model is transcribing wavs correctly when using in “nemo world” in jupyter notebook.

After that I converted it to riva using nemo2riva tool

When I tried to build it using riva-build there was error

2021-10-02 08:07:26,365 [INFO] Packing binaries for nn
2021-10-02 08:07:27,086 [INFO] Trying to extract from model test.riva
2021-10-02 08:07:27,716 [INFO] Packing binaries for lm_decoder
2021-10-02 08:07:27,716 [INFO] Trying to copy model binary from /tmp/tmpxnxemwyf/vocab.txt into rmir at /data/rmir/test.rmir.
2021-10-02 08:07:28,435 [INFO] Trying to extract from model test.riva
2021-10-02 08:07:29,075 [WARNING] Could not extract tokenizer.model to /tmp/tmpu99di62d, continuing
2021-10-02 08:07:29,085 [WARNING] Continuing with next model after failed extract
2021-10-02 08:07:29,086 [ERROR] Could not find required binary for lm_decoder at location tokenizer.model
NoneType: None
2021-10-02 08:07:29,086 [INFO] Packing binaries for rescorer
2021-10-02 08:07:29,087 [INFO] Trying to copy model binary from /tmp/tmpxnxemwyf/vocab.txt into rmir at /data/rmir/test.rmir.
2021-10-02 08:07:29,088 [INFO] Packing binaries for vad
2021-10-02 08:07:29,088 [INFO] Trying to copy model binary from /tmp/tmpxnxemwyf/vocab.txt into rmir at /data/rmir/test.rmir.
Exception ignored in: <function RMIR.del at 0x7f82fb6e75e0>
Traceback (most recent call last):
File “/opt/conda/lib/python3.8/site-packages/servicemaker/rmir/rmir.py”, line 49, in del
self._eff_ctx.exit(None, None, None)
File “/opt/conda/lib/python3.8/contextlib.py”, line 120, in exit
next(self.gen)
File “”, line 327, in create
File “”, line 263, in save_to
File “”, line 34, in crc32_file
FileNotFoundError: [Errno 2] No such file or directory: ‘/tmp/tmpw517zv1h/lm_decoder-tokenizer.model’

after some digging i find out that in citrinet model using SPE tokenizer there is file lm_decoder-tokenizer.model packed in nemo.

Then I figure out that riva-build is python package. And after hour with crazy debug print warnings written in vim I detect problem.

In file /opt/conda/lib/python3.8/site-packages/servicemaker/pipelines/asr.py on line 351 I have to change if self.tokenizer_model != "": to if self.tokenizer_model == "spe":

After this “quickhack” it is magically working and build and deploy is performed without errors and riva started successfully (and custom model is possible to call).

Is there some official bugreport for riva?

Thanks for great work!
Tomas

spolisetty · October 4, 2021, 8:56am

Hi,

Could you please share us issue repro steps and model to try from our end.

Thank you.

ilb · December 19, 2021, 6:38pm

FYI. There’s also another bug, which prevents deploying any model that does not use the SentencePiece unigram type tokenizer. Indeed, /opt/conda/lib/python3.8/site-packages/servicemaker/triton/asr.py line 457 calls return sp.NBestEncodeAsPieces(l, nbest_size), which will return an empty list if the tokenizer is not of unigram type. See Google Colab and Google Colab.

rleary · December 22, 2021, 3:27pm

Thanks for this report. We currently only support SentencePiece encoding for subword ASR models, but have filed a bug for us to add support for other tokenizations officially.

Topic		Replies	Views
Riva ASR Citrinet with WPE tokenizer Riva	3	412	August 30, 2022
[BUG] Riva deploy model with non-unigram BPE tokenizer Riva	1	608	January 14, 2022
Riva model deployment issue Riva inception	8	1535	April 4, 2024
Language model with citrinet model is not working Riva nemo , riva	2	646	September 6, 2022
Problems when running ./riva_init.sh with custom Quartznet Model Riva	1	748	September 7, 2021
Riva Build fails for finetuned conformer NeMo models with batch size 1 Riva	2	733	November 1, 2022
Failed to deploy citrinet nemo to riva Riva riva	0	598	December 3, 2021
RIVA error, when deploying official Conformer ASR network Riva riva	10	1909	January 27, 2023
Riva providing empty transcriptions for a few audios, but nemo does not for those audios Riva python , nemo , riva	4	841	November 21, 2022
Rebuilding the asrset3 citrinet offline pipeline but with larger chunk size Riva	10	1309	February 16, 2022

Bugreport: Riva ASR Citrinet with WPE tokenizer

Related Topics