[BUG] Riva deploy model with non-unigram BPE tokenizer

Please provide the following information when requesting support.

Hardware - GPU (A100/A30/T4/V100)
Hardware - CPU
Operating System ubuntu 18.04
Riva Version 1.8.0b0, NeMo 1.6.0r0
TLT Version (if relevant)

call nemo2riva for model build with non-unigram BPE tokenizer, follow by riva2rmir. Both succeed, deploy_all_models will however fail with the following error:

2022-01-13 21:39:18,634 [ERROR] Traceback (most recent call last):                                                                                                                                                                             
  File "/opt/conda/lib/python3.8/site-packages/servicemaker/cli/deploy.py", line 100, in deploy_from_rmir                                                                                                                                      
    generator.serialize_to_disk(                                                                                                                                                                                                               
  File "/opt/conda/lib/python3.8/site-packages/servicemaker/triton/triton.py", line 397, in serialize_to_disk                                                                                                                                  
    module.serialize_to_disk(repo_dir, rmir, config_only, verbose, overwrite)                                                                                                                                                                  
  File "/opt/conda/lib/python3.8/site-packages/servicemaker/triton/triton.py", line 281, in serialize_to_disk                                                                                                                                  
    self.update_binary(version_dir, rmir, verbose)                                                                                                                                                                                             
  File "/opt/conda/lib/python3.8/site-packages/servicemaker/triton/asr.py", line 505, in update_binary                                                                                                                                         
    RivaSpeechCTCFlashlightDecoder.vocab_to_lexicon(tokenizer, vocab_file, self.config.lexicon_file)                                                                                                                                           
  File "/opt/conda/lib/python3.8/site-packages/servicemaker/triton/asr.py", line 568, in vocab_to_lexicon                                                                                                                                      
    enc_lines = list(map(encode_line, lines))                                                                                                                                                                                                  
  File "/opt/conda/lib/python3.8/site-packages/servicemaker/triton/asr.py", line 558, in encode_line                                                                                                                                           
    encoded_line = encode(line)[0]                                                                                                                                                                                                             
IndexError: list index out of range                                                                                                                                                                                                            

The culprit is line 531 in /opt/conda/lib/python3.8/site-packages/servicemaker/triton/asr.py, which calls return sp.NBestEncodeAsPieces(l, nbest_size). This call will return an empty list in case of a non-unigram tokenizer (see Google Colab). Since nbest_size is set to 1 (line 524), consider changing the line to return [sp.encode_as_pieces(l)] to support also non-unigram tokenizers.

1 Like

Hi @ilb , thanks for reporting this issue. I’ll reach out to the team about this. Please stay tuned.

1 Like