Would love to see a detailed playbook on how to train/fine-tune a TTS model where we can use a custom voice (say my voice) on a DGX spark. That would a super-fun demo! Thanks!
I am not sure you need “fine tuning” for that. Have you seen f5-TTS? You provide it a 10s audio sample. It “learns” the voice, then you give it a script and it creates an audio file of it speaking in that voice. It is an old project but impressed me when I found it. Seems that it is still maintained (last commit 2 weeks ago).
I experimented with it to play pranks on some of my friends by using their voice samples as the source voice, or sending them fake personalized messages from celebrities and was pretty impressed! You also don’t need crazy hardware to run it. IIRC, it used <10 GB VRAM while generating.
Thanks for the suggestion!
The point here is to build from scratch, not to run someone else’s canned solution.
Hey ChuckForbin,
did you manage to deploy F5-TTS on the Spark?