Low startup time neural text to speech options

I’m trying to make a bit of a homebrew voice assistant and I’m having a lot of trouble with the “voice” part.
I managed to get a copy of a real time voice cloning GitHub project running, which was cool, but it takes nearly 40 seconds to come online. In order for my project to be useful I figure I have 10 seconds from the point the board hears the wake word. As soon as a separate script detects the wake word I figure it can trigger the tts model to start loading into memory, then by the time I finish my request it can start synthesizing the response.

I thought that the FastSpeech model in the deep learning example git might be what I need. But I can’t figure out how to use it. I literally just need (text > python script > .wav file) or output to speaker. But all the documentation I can find anywhere is only on how to train it. It seems to be globally assumed that once you train it you know what to do with it and it’s frustrating.

So I guess my questions boil down to
Can a tensorflow, pytorch, or tensorrt model be used like this? Either small enough to keep constantly loaded in ram or capable of being cold started in around 10 seconds?

I know that old fashion text to speech methods may be better suited, but I bought this board for machine learning so I’m just about determined to pull my hair out before I go crawling back to hidden Markov models.
Sorry for the slight rant. Any help is very much appreciated!


You can deploy a TensorFlow, PyTorch or TensorRT model on NX.
But the performance will depend on the model complexity.

We do have an example to inference FastSpeech with TensorRT.
You can give it a try to see if the performance is acceptable or not.