Riva TTS in A2F headless mode

Hello. I was wondering if it possible or at least planned to be able to call Riva TTS in headless mode.
I successfully set up a Riva TTS server and I can generate audio from text inside A2F using the extension but

  1. The extension settings are reset every time I go back in the scene (only the extension is automatically reloaded). Is there any way not to lose the settings every time?

  2. There seems to be no REST service to simply pass a string to Riva TTS inside A2F. This would be very nice to have as it would activate the synthesis/streaming pipeline down to a LiveLink application (I am using Unreal)

Thank you!

Hi @antori

There’s no plans on updating the Rest API. But you could add own extension by looking at Audio2Face REST APIs omni.audio2face.core.scripts.routes and omni.audio2face.exporter.scripts.routes

You could also update the extension to keep the settings. Or even load your custom settings when it gets loaded.

Hi, @Ehsan.HM.
Thank you for your feedback! I will look into these.

Hi Antori,

could you share your observations on the quality of speech synthesis, for example, in comparison to ElevenLabs. Can SSML language be used? I myself am planning to implement offline speech synthesis based on Riva in the next step. And have you tested STT? How is it with speech recognition? I would be very grateful for a handful of information on these topics :)

Best regards
Chris

Hi, @chris508 .
I am used to the Microsoft Azure TTS which I find quite nice among the ones I tried. The standard NVidia Riva voice for English is ok (prosody is not always great, though) but for my research I am interested in being able to build systems that do not rely on third party online services as much as possible so quality is secondary, for now (I can always train my own set of voices with NVidia).

I did not test NVidia ASR because I am using Azure for that since it can perform ASR and NLU as an integrated step which responds quickly. I will try NVIdia Riva for that after finishing with TTS, which is a priority for me right now, together with Audio2Face and Audio2Gesture features. In any case, I would not expect NVidia ASR to be bad, though, since it has become relatively easy to do.

NVidia Riva does support SSML. Documentation for that is here

https://docs.nvidia.com/deeplearning/riva/archives/2-2-0/user-guide/docs/tutorials/tts-python-advanced-customizationwithssml.html

Best regards,
Antonio