Issues with Tokenizer in RIVA with TTS Es Multispeaker FastPitch HiFiGAN

Hello,

I’m integrating a Latin American Spanish text-to-speech (TTS) model into NVIDIA Riva, using the tts_es_multispeaker_fastpitchhifigan model from NVIDIA NGC. I have converted the model to a .riva extension using nemo2riva.

Additionally, I downloaded the tokenizer and verbalizer from Inverse Normalization ES-US and used the Docker servicemaker image to compile the model with specific configurations (including language_code es-US).

docker run --init -it --rm --gpus '"'"device=0"'"'   -v $(pwd):/data  -v riva-model-repo:/data-volumen  -e "MODEL_DEPLOY_KEY=tlt_encode"   --name riva-service-maker   nvcr.io/nvidia/riva/riva-speech:2.13.0-servicemaker
riva-build speech_synthesis  tts_es_hifigan_ft_fastpitch_multispeaker.rmir:tlt_encode tts_es_fastpitch_multispeaker.riva:tlt_encode  tts_es_hifigan_ft_fastpitch_multispeaker.riva:tlt_encode  --voice_name Latin-American-Spanish --wfst_tokenizer_model=tokenize_and_classify.far --wfst_verbalizer_model=verbalize.far --sample_rate 44100 --language_code es-US --num_speakers=174 --phone_set=ipa   --subvoices 0:0,1:1,2:2,3:3,4:4,5:5,6:6,7:7,8:8,9:9,10:10,11:11,12:12,13:13,14:14,15:15,16:16,17:17,18:18,19:19,20:20,21:21,22:22,23:23,24:24,25:25,26:26,27:27,28:28,29:29,30:30,31:31,32:32,33:33,34:34,35:35,36:36,37:37,38:38,39:39,40:40,41:41,42:42,43:43,44:44,45:45,46:46,47:47,48:48,49:49,50:50,51:51,52:52,53:53,54:54,55:55,56:56,57:57,58:58,59:59,60:60,61:61,62:62,63:63,64:64,65:65,66:66,67:67,68:68,69:69,70:70,71:71,72:72,73:73,74:74,75:75,76:76,77:77,78:78,79:79,80:80,81:81,82:82,83:83,84:84,85:85,86:86,87:87,88:88,89:89,90:90,91:91,92:92,93:93,94:94,95:95,96:96,97:97,98:98,99:99,100:100,101:101,102:102,103:103,104:104,105:105,106:106,107:107,108:108,109:109,110:110,111:111,112:112,113:113,114:114,115:115,116:116,117:117,118:118,119:119,120:120,121:121,122:122,123:123,124:124,125:125,126:126,127:127,128:128,129:129,130:130,131:131,132:132,133:133,134:134,135:135,136:136,137:137,138:138,139:139,140:140,141:141,142:142,143:143,144:144,145:145,146:146,147:147,148:148,149:149,150:150,151:151,152:152,153:153,154:154,155:155,156:156,157:157,158:158,159:159,160:160,161:161,162:162,163:163,164:164,165:165,166:166,167:167,168:168,169:169,170:170,171:171,172:172,173:173
riva-deploy -f tts_es_hifigan_ft_fastpitch_multispeaker.rmir:tlt_encode /data/models

After deploying the model, I’ve encountered a significant issue: the tokenizer struggles with words that require tokenization, such as dates, numbers, and words either enclosed in quotes or written in camel case. Instead of processing and pronouncing these elements correctly, it either skips. This is a stark contrast to the performance of the default models in riva_quickstart_v2.13.0 , where the es-ES model efficiently handles such tokenization challenges without any noticeable issues.

My questions are as follows:

  1. What could be causing this issue with the tokenizer?
  2. Am I making any mistakes in the configuration or implementation process?
  3. Does the tts_es_multispeaker_fastpitchhifigan model not support expressions like numbers or dates?
  4. Where can I find the tokenizer and verbalizer used in the Spain Spanish model that seems to work correctly?

Any guidance or suggestions to resolve this issue would be greatly appreciated.

Best regards.