[BUG] Riva 1.8.0 punctuation pipeline

ilb · January 22, 2022, 11:54pm

Has the known issue from Riva 1.7.0 beta:

The punctuation pipeline does not support unicode character input. This will be fixed in the next release.

been resolved?

I am working with Riva 1.8.0 beta and NeMo:1.6.0rc0 on a non-english ASR+punctuation pipeline. I am able to convert and deploy both, the ASR and punctuation model. However, when I run the examples/transcribe_file.py the Riva/Triton log prompts:

E0122 23:35:30.347671   326 grpc_riva_asr.cc:231] ITN not supported for language: __
W0122 23:35:30.347699   326 grpc_riva_asr.cc:241] Punctuation not supported for __ language

regardless if I built the punctuation.rmir with --language_code __ or not.

If I register the ASR and punctuation as en-US, however portions of the resulting transcription are dropped whenever punctuation is enabled. Note that outside of Riva the punctuation model works as it should. The only thing is that the language uses some utf-8 characters.

ilb · January 23, 2022, 3:25pm

In the official Riva 1.8.0 documentation under the section “Inverse Text Normalization” one can read “Currently, the grammars are limited to English. In a future release, additional information on training, tuning, and loading custom grammars will be available.” Is there any roadmap to when this additional information will be available? We would love to develop the ITN for our language of choice, but we’d need more information to when and how will this be supported under Riva. Even a heads up that it is/will be based on NeMo/text_processing would be a good first step.