TTS on Jarvis generates long strange sounds after ending the sentence

Hi, I trained Tacotron2 in Thai language using NeMo and deploy it to Jarvis. The result with NeMo is fine, but Javis generate long strange sounds after ending the sentence.


Input sentence : “ทำ ไร กัน อยู่ กิน ข้าว กิน ปลา รึ ยัง”
Pronunciation in English: “tam rai gun yu kin khaew kin pla rue yang”
This should end in two seconds.

I used jarvis version 1.1 beta.

Another question, Is it support other TTS models? such as FastSpeech, FastPitch. The document shows only an example of Tacotron2.

Hi ,
Could you please share the Nemo model, script and log files so we can help better?


jarvis-service-maker logs:
jarvis_service_maker.txt (5.4 KB)

jarvis-server logs:
jarvis_server.log (82.5 KB)

nemo model

test script
jarvis_tts_TEST.ipynb (154.9 KB)

This is a known bug due to Tacotron2 not having an explicit duration model. The model has to “decide” to stop generating, and sometimes it does never happen, causing the model to generate those strange sounds after it finishes generating the input.
Since we cannot predict how long the sentence will be, this happens (especially in models not trained long enough or on small datasets).
Explicit duration model support will be added in future release.