Generate Natural Sounding Speech from Text in Real-Time

Originally published at:

This post, intended for developers with professional level understanding of deep learning, will help you produce a production-ready, AI, text-to-speech model. Converting text into high quality, natural-sounding speech in real time has been a challenging conversational AI task for decades. State-of-the-art speech synthesis models are based on parametric neural networks1. Text-to-speech (TTS) synthesis is typically…

You state "Our current model synthesizes samples at 125 * 22,050 = 2,756,250, which is 125 times faster than “real-time” at 22,050 samples", why RTF is then not 125 instead of 1-4 ?

Guys, hope you could correct the use of the term RTF(pls do not mix with xRTF which is 1/RTF), we do not like RTF > 1 systems which means it could not be real-time http://dictionary.sensagent...

there are two factors that influence the latency results reported here: 1) we are measuring end-to-end text-to-speech inference, i.e., the total of Tacotron2 and WaveGlow latency is reported; in the quoted sentence, the 125 refers to WaveGlow latency only. 2) In this article we were using the slower version of WaveGlow with 512 residual channels; the quoted version uses 256 channels.

This is a real good article, thank you.