TTS R&D team looking for mentors

Our primary goal is to develop multispeaker TTS system. Our current solution uses a transformer feature extractor with GST and solves the VQ-VAE objective. The vector quantization process has been improved with modifications suggested in recent publications.
For vocoder, we simply train PWG on data for both languages. So far we have been unable to achieve satisfactory results with this approach, we would be pleased if you could help us.
We also spend a lot of time and resources on stress model. It would be great if you could help us find a way to extract stresses directly from audio.
Best regards, AMAI.

Hi @m1132 , could you clarify what you’re doing with multi-lang TTS?
If I understant correctly, you’re trying to train a multispeaker text-to-speech system for 2 languages. You’re using the two-stage pipeline, a model(transformer feature extractor with GST and solves the VQ-VAE objective) to generate a mel-spectrogram and then giving it to another model, a vocoder(Parallel Wave GAN) that consumes the mel-spectrogram to generate the final audio.
The vocoder model is trained on spectrograms from both languages(assuming English,Russian), but the generated audio isn’t satisfactory.

Regarding the stress model: you’re trying to train a model to embed the prosody(voice change due to intonation, emotion, rhythm etc) in the training data audio and use it condition the spectrogram generated, but having trouble with extracting the stresses from training data audio files.

Is that a correct summary?

Hello, so far we have only tried to train vocoder on multilingual data. Since melgan training is gan, it is unstable and seems like discriminator outperformed generator.
Currently, all our tts models are a single speaker (will try 2 speakers only Russian in near future), because model would connect speaker’s acoustic features with text symbols.
Usually, peoples use phonemization, but we don’t have any good enough phonemizer at hand.

This is one stage pipeline, gst is an acoustic encoder, its outputs are added to text encoder features and feed to decoder. Like in gst paper.

Stress is a very big deal in Russian.
Any vowel might be stressed, usually only one stress per word, also it depends on linguistic context.
For now, we have nlp model to put stress on inference and manually annotated train dataset.
Intuitively, this information should be in audio data.
So yes, it is prosody control, but not in the way it would work in English.
My first try to build model for stress extraction was unsuccessfull.