TTS Input text too long

nharo · June 6, 2023, 10:01pm

Hardware - GPU(T4
Hardware - CPU
Operating System ubuntu 20.04
Riva Version 2.10

I deployed the TTS Spanish models from Nemo to Riva through a docker container, when synthesizing audio through a very long text it gives me the following error in docker logs:

E0606 21:56:38.335672 90 tts-preprocessor.cc:96] preprocessor had an errorSSML Input does not currently support split on sentence. Input text too long.'<speak version="1.0"><prosody rate="90%">Le estoy llamando de empresa, por encargo de empresa, por su tarjeta, para su seguridad, esta conversación podría ser grabada. paulina, soy su ejecutiva virtual, le informo que usted, se encuentra en una campaña, de descuento del 90 por ciento, hasta el 31 de mayo de 2023, su deuda al día de hoy es </prosody><prosody rate="low">893914 pesos,</prosody><prosody rate="90%"> y podrá pagar el total solo por </prosody><prosody rate="low">89391.</prosody><prosody rate="90%"> ¿Podría regularizar este pago, a mas tardar hoy?</prosody></speak>
E0606 21:56:38.335867 90 backend_triton_api.cc:111] Model 'tts_preprocessor-tts_spanish', instance: 'tts_preprocessor-tts_spanish_0': failed executing 1 request(s) on device 0
E0606 21:56:38.336117 1468431 libriva_tts.cc:367] error: Error occurred on Triton server during inference: in ensemble 'fastpitch_hifigan_ensemble-tts_spanish', FAILURE

I’m using this notebook code:

import numpy as np
import IPython.display as ipd
import riva.client

auth = riva.client.Auth(uri='localhost:50051')
riva_tts = riva.client.SpeechSynthesisService(auth)

sample_rate_hz = 44100
req = { 
        "language_code"  : "en-US",
        "encoding"       : riva.client.AudioEncoding.LINEAR_PCM ,   # Currently only LINEAR_PCM is supported
        "sample_rate_hz" : sample_rate_hz,                          # Generate 44.1KHz audio
         "voice_name"     : "tts_spanish"                    # The name of the voice to generate
}

req["text"] = """'<speak version="1.0"><prosody rate="90%">Le estoy llamando de empresa, por encargo de empresa, por su tarjeta, para su seguridad, esta conversación podría ser grabada. paulina, soy su ejecutiva virtual, le informo que usted, se encuentra en una campaña, de descuento del 90 por ciento, hasta el 31 de mayo de 2023, su deuda al día de hoy es </prosody><prosody rate="low">893914 pesos,</prosody><prosody rate="90%"> y podrá pagar el total solo por </prosody><prosody rate="low">89391.</prosody><prosody rate="90%"> ¿Podría regularizar este pago, a mas tardar hoy?</prosody></speak>""" 

resp = riva_tts.synthesize(**req)
audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
ipd.Audio(audio_samples, rate=sample_rate_hz)

I have also used online synthesis with this code:

import numpy as np
import IPython.display as ipd
import riva.client


auth = riva.client.Auth(uri='localhost:50051')
riva_tts = riva.client.SpeechSynthesisService(auth)

sample_rate_hz = 44100
req = { 
        "language_code"  : "en-US",
        "encoding"       : riva.client.AudioEncoding.LINEAR_PCM ,   # Currently only LINEAR_PCM is supported
        "sample_rate_hz" : sample_rate_hz,                          # Generate 44.1KHz audio
         "voice_name"     : "tts_spanish"                    # The name of the voice to generate
}


req["text"] = """<speak version="1.0"><prosody rate="90%">se encuentra en una campaña, de descuento del 90 por ciento, hasta el 31 de mayo de 2023, su deuda al día de hoy es </prosody><prosody rate="low">893914 pesos,</prosody><prosody rate="90%"> y podrá pagar el total solo por </prosody><prosody rate="low">89391.</prosody><prosody rate="90%"> ¿Podría regularizar este pago, a mas tardar hoy?</prosody></speak>"""
resp = riva_tts.synthesize_online(**req)
empty = np.array([])
for i, rep in enumerate(resp):
    audio_samples = np.frombuffer(rep.audio, dtype=np.int16) / (2*15)
    print("Chunk: ",i)
    ipd.display(ipd.Audio(audio_samples, rate=44100))
    empty = np.concatenate((empty, audio_samples))

print("Final synthesis:")
ipd.display(ipd.Audio(empty, rate=44100))

With online synthesis I got this error:

_MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = "Error: Triton model failed during inference. Error message: Streaming timed out"
	debug_error_string = "UNKNOWN:Error received from peer ipv4:localhost:50051 {created_time:"2023-06-06T17:19:36.758526327-05:00", grpc_status:2, grpc_message:"Error: Triton model failed during inference. Error message: Streaming timed out"}"

How can I continue using SSML tags and make TTS capable of supporting long texts?

rvinobha · June 9, 2023, 7:36am

HI @nharo

Thanks for your interest in Riva

I will try to reproduce the issue internally within Nvidia

Request to kindly share the NGC link of Nemo TTS Model used

Thanks

I will try to reproduce the issue inter

nharo · June 9, 2023, 3:01pm

Sure, this is the link to the model: TTS Es Multispeaker FastPitch HiFiGAN | NVIDIA NGC