Hello, I trained FastPitch from scratch on an Italian dataset (z-uo/male-LJSpeech-italian) using the attached config, then trained HiFi-GAN as vocoder.
Problem: Generated audio quality is good for the first ~10 seconds, then noticeably degrades. For example, an 13-second generation has clean audio for the first 10 seconds, but the remaining 3 seconds sound distorted/garbled.
Test: attached a .zip with an ok audio and an audio with last 5 seconds degraded
i run the train with
python3 fastpitch.py
–config-name=fastpitch_align_v1.05.yaml
model.train_ds.dataloader_params.batch_size=32
model.validation_ds.dataloader_params.batch_size=32
train_dataset=./italian_tts_data/manifest_train.json
validation_datasets=./italian_tts_data/manifest_val.json
sup_data_path=./italian_tts_data/sup_data
exp_manager.exp_dir=./results
Has anyone experienced the same issue? Any insights on what might be causing this?
Thanks a lot
config_fastpitch.txt (5.9 KB)
test-audio.zip (1.1 MB)
Update — additional tests on RTX 4060 and RTX 4090
I ran two additional FastPitch-from-scratch trainings on the same Italian dataset, using identical training commands and almost identical configs.
Hardware / environment differences

All other configuration parameters (including pitch_mean, pitch_std, fmin, fmax) and the training command were the same.
Observed audio differences
-
The 4060ti (NeMo 2.2.0) model produces more stable audio:
degradation still happens after ~20 seconds (on a 28s audio), but it is noticeably milder.
-
The 4090 (NeMo 2.2.1) model produces much worse long-form degradation:
the first ~15 seconds sound correct, but the last part of the generation becomes heavily distorted/garbled, audio is sensibly slower than the one in 4060
-
same text input, on 4060 generates 28s audio, on 4090 produces 34s audio
-
Both models were synthesized using the same HiFi-GAN (tts_en_lj_hifigan_ft_mixerttsx).
Even more:
we trained also fastpitch with english ljspeech dataset (the same EN dataset used for the EN pretrained model) and the model shows a similar small quality drop at the end of long-form generations, which suggests this might not be dataset-specific.
I’ve attached two new audio samples comparing the 4060ti and 4090 outputs on the same input text.
tests-4060ti-4090.zip (2.1 MB)
If anyone has seen similar behavior or has suggestions on what to check (dataset pitch stats, config simplification, NeMo version pinning, long-form TTS stability issues, etc.), any insight would be super helpful.