FastPitch trained from scratch: audio quality degrades after ~10 seconds

Hello, I trained FastPitch from scratch on an Italian dataset (z-uo/male-LJSpeech-italian) using the attached config, then trained HiFi-GAN as vocoder.

Problem: Generated audio quality is good for the first ~10 seconds, then noticeably degrades. For example, an 13-second generation has clean audio for the first 10 seconds, but the remaining 3 seconds sound distorted/garbled.

Test: attached a .zip with an ok audio and an audio with last 5 seconds degraded

i run the train with
python3 fastpitch.py
–config-name=fastpitch_align_v1.05.yaml
model.train_ds.dataloader_params.batch_size=32
model.validation_ds.dataloader_params.batch_size=32
train_dataset=./italian_tts_data/manifest_train.json
validation_datasets=./italian_tts_data/manifest_val.json
sup_data_path=./italian_tts_data/sup_data
exp_manager.exp_dir=./results

Has anyone experienced the same issue? Any insights on what might be causing this?

Thanks a lot

config_fastpitch.txt (5.9 KB)

test-audio.zip (1.1 MB)

Update — additional tests on RTX 4060 and RTX 4090

I ran two additional FastPitch-from-scratch trainings on the same Italian dataset, using identical training commands and almost identical configs.

Hardware / environment differences

Image

All other configuration parameters (including pitch_mean, pitch_std, fmin, fmax) and the training command were the same.

Observed audio differences

  • The 4060ti (NeMo 2.2.0) model produces more stable audio:
    degradation still happens after ~20 seconds (on a 28s audio), but it is noticeably milder.

  • The 4090 (NeMo 2.2.1) model produces much worse long-form degradation:
    the first ~15 seconds sound correct, but the last part of the generation becomes heavily distorted/garbled, audio is sensibly slower than the one in 4060

  • same text input, on 4060 generates 28s audio, on 4090 produces 34s audio

  • Both models were synthesized using the same HiFi-GAN (tts_en_lj_hifigan_ft_mixerttsx).

Even more:
we trained also fastpitch with english ljspeech dataset (the same EN dataset used for the EN pretrained model) and the model shows a similar small quality drop at the end of long-form generations, which suggests this might not be dataset-specific.

I’ve attached two new audio samples comparing the 4060ti and 4090 outputs on the same input text.

tests-4060ti-4090.zip (2.1 MB)

If anyone has seen similar behavior or has suggestions on what to check (dataset pitch stats, config simplification, NeMo version pinning, long-form TTS stability issues, etc.), any insight would be super helpful.