Jetson-Voice tts from Dusty_NV input sentence length

Nick_H · September 9, 2021, 12:17pm

Hi I am playing around with the tts in the Dusty NV github repo and I was hoping there was a way to enter a sentence at a time it seems i hit a limit around 8 words of text. What Happens is that the file playback begins to sound very choppy. It seems to me that it would be optimal to have a sentence at a time be cycled to the wave. I cannot find where the limiting factor is in the 8 word limit. Is there a way to change this?

I need to have this to get through material as i cannot sit and read due to psychical limitations. What I have so far works but the choppy breaks are annoying.

example audio output.wav

dusty_nv · September 9, 2021, 4:26pm

Hi @Nick_H, I believe the max length is defined here:

https://github.com/dusty-nv/jetson-voice/blob/1ec13b77f493d399f31c31f0af0650bcdbca8bc0/jetson_voice/models/tts/tts_engine.py#L43

Nick_H · September 9, 2021, 4:31pm

Funny does that count spaces?

I will have a look at thanks Dusty!

dusty_nv · September 9, 2021, 4:36pm

So those units are in MEL features and not in tokens/characters. That’s because there is pre-processing that happens. But yes, I believe spaces count because the spaces are used to let the TTS know when to end words.

Nick_H · September 9, 2021, 8:05pm

Wrote audio to data/audio/tts_test/0.wav
And yet, hardly anything of what they said is true.

Run 0 -- Time to first audio: 0.369s. Generated 3.49s of audio. RTFx=9.46.
Run 1 -- Time to first audio: 0.137s. Generated 3.49s of audio. RTFx=25.43.
Run 2 -- Time to first audio: 0.138s. Generated 3.49s of audio. RTFx=25.24.
Run 3 -- Time to first audio: 0.155s. Generated 3.49s of audio. RTFx=22.60.
Run 4 -- Time to first audio: 0.138s. Generated 3.49s of audio. RTFx=25.37.
Run 5 -- Time to first audio: 0.133s. Generated 3.49s of audio. RTFx=26.30.

Wrote audio to data/audio/tts_test/1.wav
Of the many lies they told, one in particular surprised me, namely that you should be careful not to be deceived by an accomplished speaker like me.

[TensorRT] ERROR: 1: [deconv.cu::deconv_half8_explicit_gemm::234] Error Code 1: Cuda Runtime (invalid configuration argument)
Run 0 -- Time to first audio: 0.501s. Generated 8.87s of audio. RTFx=17.69.
[TensorRT] ERROR: 1: [deconv.cu::deconv_half8_explicit_gemm::234] Error Code 1: Cuda Runtime (invalid configuration argument)
Run 1 -- Time to first audio: 0.233s. Generated 8.87s of audio. RTFx=38.07.
[TensorRT] ERROR: 1: [deconv.cu::deconv_half8_explicit_gemm::234] Error Code 1: Cuda Runtime (invalid configuration argument)
Run 2 -- Time to first audio: 0.237s. Generated 8.87s of audio. RTFx=37.41.
[TensorRT] ERROR: 1: [deconv.cu::deconv_half8_explicit_gemm::234] Error Code 1: Cuda Runtime (invalid configuration argument)
Run 3 -- Time to first audio: 0.239s. Generated 8.87s of audio. RTFx=37.17.
[TensorRT] ERROR: 1: [deconv.cu::deconv_half8_explicit_gemm::234] Error Code 1: Cuda Runtime (invalid configuration argument)
Run 4 -- Time to first audio: 0.244s. Generated 8.87s of audio. RTFx=36.42.
[TensorRT] ERROR: 1: [deconv.cu::deconv_half8_explicit_gemm::234] Error Code 1: Cuda Runtime (invalid configuration argument)
Run 5 -- Time to first audio: 0.247s. Generated 8.87s of audio. RTFx=35.84.

Wrote audio to data/audio/tts_test/2.wav
That they were not ashamed to be immediately proved wrong by the facts, when I show myself not to be an accomplished speaker at all, that I thought was most shameless on their part—unless indeed they call an accomplished speaker the man who speaks the truth.

[TensorRT] ERROR: 3: [executionContext.cpp::setBindingDimensions::969] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::969, condition: profileMaxDims.d[i] >= dimensions.d[i]. Supplied binding dimension [1,80,1151] for bindings[0] exceed min ~ max range at index 2, maximum dimension in profile is 1024, minimum dimension in profile is 1, but supplied dimension is 1151.
)
Traceback (most recent call last):
  File "examples/tts.py", line 50, in <module>
    audio = tts(i)
  File "/jetson-voice/jetson_voice/models/tts/tts_engine.py", line 81, in __call__
    audio = self.vocoder.execute(mels)
  File "/jetson-voice/jetson_voice/backends/tensorrt/trt_model.py", line 114, in execute
    setup_binding(self.bindings[idx], input)
  File "/jetson-voice/jetson_voice/backends/tensorrt/trt_model.py", line 109, in setup_binding
    binding.set_shape(input.shape)
  File "/jetson-voice/jetson_voice/backends/tensorrt/trt_binding.py", line 80, in set_shape
    raise ValueError(f"failed to set binding '{self.name}' with shape {shape}")
ValueError: failed to set binding 'mels' with shape (1, 80, 1151)
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2966, GPU 7057 (MiB)

dusty_nv · September 15, 2021, 5:44pm

Did you try increasing the limits in the code I linked to above, and also deleting the *.engine file under jetson-voice/data/networks/tts/fastpitch-hifigan?

Topic		Replies	Views
Do we have a text to speech feature? Jetson TX2	39	5191	October 18, 2021
Getting a Real Time Factor Over 60 for Text-To-Speech Services Using NVIDIA Jarvis Technical Blog	0	433	August 25, 2020
TTS on Jarvis generates long strange sounds after ending the sentence Riva riva	4	609	June 10, 2021
Generate Natural Sounding Speech from Text in Real-Time Technical Blog	4	509	April 27, 2020
TX2 + TensorRT Benchmarks for RNN/LSTM Jetson TX2	2	999	October 18, 2021
Add more voices for pyttsx3, text to speech Jetson Nano jetson-inference	2	4623	October 15, 2021
Voice Demo Container for Jetson Xavier NX not working Jetson Xavier NX audio	11	1910	October 18, 2021
TTS Input text too long Riva	2	609	June 9, 2023
The Riva TTS service is limited to < 400 characters long input strings Riva	4	1101	January 20, 2022
Jetson-Voice acceptable Microphone have 3 none work with the docker "OSError: [Errno -9997] Invalid sample rate" Jetson Xavier NX jetson-inference	5	1455	October 9, 2021

Jetson-Voice tts from Dusty_NV input sentence length

Related topics