Host System: Windows 10 version 21H2
Nvidia Driver on Host system: 522.25 Studio Version
Videocard: Geforce RTX 4090
CUDA Toolkit in WSL2: cuda-repo-wsl-ubuntu-11-8-local_11.8.0-1_amd64.deb
Pytorch versions tested: Latest (stable - 1.12.1) for CUDA 11.6 , Nightly for CUDA11.7
Python version: 3.8.10
WSL2 Guest: Ubuntu 20.04 LTS
WSL2 Guest Kernel Version: 5.10.102.1-microsoft-standard-WSL2
Affected CUDA component: cuFFT
I’m executing the VITS model training code of GitHub - coqui-ai/TTS: 🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production without any code edits. Dev branch, commit dae79b0acd3cd316016078c40a1cc553ffb9405e
This worked flawlessly up until the point when I swapped my videocard from a Geforce RTX 3090 to a 4090 yesterday.
Now I am running into a bug(?) in cuFFT:
/home/localuser/coquiTTS/.VENV/lib/python3.8/site-packages/torch/functional.py:632: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:801.)
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
! Run is removed from DATASETS/CodexNarrator/output/CodexNarrator_vits-October-14-2022_10+50PM-dae79b0a
Traceback (most recent call last):
File "/home/localuser/coquiTTS/.VENV/lib/python3.8/site-packages/trainer/trainer.py", line 1533, in fit
self._fit()
File "/home/localuser/coquiTTS/.VENV/lib/python3.8/site-packages/trainer/trainer.py", line 1517, in _fit
self.train_epoch()
File "/home/localuser/coquiTTS/.VENV/lib/python3.8/site-packages/trainer/trainer.py", line 1282, in train_epoch
_, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
File "/home/localuser/coquiTTS/.VENV/lib/python3.8/site-packages/trainer/trainer.py", line 1099, in train_step
batch = self.format_batch(batch)
File "/home/localuser/coquiTTS/.VENV/lib/python3.8/site-packages/trainer/trainer.py", line 910, in format_batch
batch = self.model.format_batch_on_device(batch)
File "/home/localuser/coquiTTS/TTS/TTS/tts/models/vits.py", line 1505, in format_batch_on_device
batch["spec"] = wav_to_spec(wav, ac.fft_size, ac.hop_length, ac.win_length, center=False)
File "/home/localuser/coquiTTS/TTS/TTS/tts/models/vits.py", line 123, in wav_to_spec
spec = torch.stft(
File "/home/localuser/coquiTTS/.VENV/lib/python3.8/site-packages/torch/functional.py", line 632, in stft
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR
cuFFT throws this runtime error no matter what I try - I’ve tried disabling mixed precision training mode but that had no impact on it.
Crucially, this only affects the spectrogram generation step of the training module, as cuFFT is getting involved at only this step.
Inference on a VITS model executes just fine (and I’m loving the speed bump that the 4090 brings!).
I am unsure if this is strictly a problem between Pytorch and Cuda, with Pytorch needing updating, or if Cuda itself is the culprit here - but it is definitely related to the GPU upgrade as the code is identical.
Is this a (known) bug and/or is there a workaround?