Bug: Ubuntu on WSL2 - RTX4090 related cuFFT runtime error

Host System: Windows 10 version 21H2
Nvidia Driver on Host system: 522.25 Studio Version

Videocard: Geforce RTX 4090

CUDA Toolkit in WSL2: cuda-repo-wsl-ubuntu-11-8-local_11.8.0-1_amd64.deb
Pytorch versions tested: Latest (stable - 1.12.1) for CUDA 11.6 , Nightly for CUDA11.7
Python version: 3.8.10
WSL2 Guest: Ubuntu 20.04 LTS
WSL2 Guest Kernel Version:

Affected CUDA component: cuFFT

I’m executing the VITS model training code of GitHub - coqui-ai/TTS: 🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production without any code edits. Dev branch, commit dae79b0acd3cd316016078c40a1cc553ffb9405e

This worked flawlessly up until the point when I swapped my videocard from a Geforce RTX 3090 to a 4090 yesterday.

Now I am running into a bug(?) in cuFFT:

/home/localuser/coquiTTS/.VENV/lib/python3.8/site-packages/torch/functional.py:632: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:801.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
 ! Run is removed from DATASETS/CodexNarrator/output/CodexNarrator_vits-October-14-2022_10+50PM-dae79b0a
Traceback (most recent call last):
  File "/home/localuser/coquiTTS/.VENV/lib/python3.8/site-packages/trainer/trainer.py", line 1533, in fit
  File "/home/localuser/coquiTTS/.VENV/lib/python3.8/site-packages/trainer/trainer.py", line 1517, in _fit
  File "/home/localuser/coquiTTS/.VENV/lib/python3.8/site-packages/trainer/trainer.py", line 1282, in train_epoch
    _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
  File "/home/localuser/coquiTTS/.VENV/lib/python3.8/site-packages/trainer/trainer.py", line 1099, in train_step
    batch = self.format_batch(batch)
  File "/home/localuser/coquiTTS/.VENV/lib/python3.8/site-packages/trainer/trainer.py", line 910, in format_batch
    batch = self.model.format_batch_on_device(batch)
  File "/home/localuser/coquiTTS/TTS/TTS/tts/models/vits.py", line 1505, in format_batch_on_device
    batch["spec"] = wav_to_spec(wav, ac.fft_size, ac.hop_length, ac.win_length, center=False)
  File "/home/localuser/coquiTTS/TTS/TTS/tts/models/vits.py", line 123, in wav_to_spec
    spec = torch.stft(
  File "/home/localuser/coquiTTS/.VENV/lib/python3.8/site-packages/torch/functional.py", line 632, in stft
    return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR

cuFFT throws this runtime error no matter what I try - I’ve tried disabling mixed precision training mode but that had no impact on it.

Crucially, this only affects the spectrogram generation step of the training module, as cuFFT is getting involved at only this step.
Inference on a VITS model executes just fine (and I’m loving the speed bump that the 4090 brings!).

I am unsure if this is strictly a problem between Pytorch and Cuda, with Pytorch needing updating, or if Cuda itself is the culprit here - but it is definitely related to the GPU upgrade as the code is identical.

Is this a (known) bug and/or is there a workaround?

I can confirm this problem on a plain Linux using the code example supplied here:

For the sake of completeness, here the reproducer:

#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
#include <cufft.h>

#ifdef _CUFFT_H_
    static const char *cufftGetErrorString( cufftResult cufft_error_type ) {
        switch( cufft_error_type ) {
            case CUFFT_SUCCESS:
                return "CUFFT_SUCCESS: The CUFFT operation was performed";
            case CUFFT_INVALID_PLAN:
                return "CUFFT_INVALID_PLAN: The CUFFT plan to execute is invalid";
            case CUFFT_ALLOC_FAILED:
                return "CUFFT_ALLOC_FAILED: The allocation of data for CUFFT in memory failed";
            case CUFFT_INVALID_TYPE:
                return "CUFFT_INVALID_TYPE: The data type used by CUFFT is invalid";
            case CUFFT_INVALID_VALUE:
                return "CUFFT_INVALID_VALUE: The data value used by CUFFT is invalid";
            case CUFFT_INTERNAL_ERROR:
                return "CUFFT_INTERNAL_ERROR: An internal error occurred in CUFFT";
            case CUFFT_EXEC_FAILED:
                return "CUFFT_EXEC_FAILED: The execution of a plan by CUFFT failed";
            case CUFFT_SETUP_FAILED:
                return "CUFFT_SETUP_FAILED: The setup of CUFFT failed";
            case CUFFT_INVALID_SIZE:
                return "CUFFT_INVALID_SIZE: The size of the data to be used by CUFFT is invalid";
            case CUFFT_UNALIGNED_DATA:
                return "CUFFT_UNALIGNED_DATA: The data to be used by CUFFT is unaligned in memory";
        return "Unknown CUFFT Error";
#define BATCH 1

int main(int argc, char** argv) {
    unsigned long int data_block_length = 50397139;
    cufftResult cufft_result;
    cufftHandle plan;
    cufft_result = cufftPlan1d(&plan, data_block_length, CUFFT_Z2Z, BATCH );

    if( cufft_result != CUFFT_SUCCESS ) {
       printf( "CUFFT Error (%s)\n", cufftGetErrorString( cufft_result ) );
    return 0;

compiles with g++ thefile.cpp -lcufft and result is:

CUFFT Error (CUFFT_INTERNAL_ERROR: An internal error occurred in CUFFT)

It seems like @Robert_Crovella might have an idea already, because I just saw, that he replied on that very stackexchange question. Would be great to get his thoughts on this. :)

Edit: (I’m “out of replies” - apparently that’s a thing now… :D)
@Robert_Crovella Thanks a lot for your feedback! I was on CUDA 11.7 and yes, with 11.8 it does indeed work on Linux.
Therefore I tested Windows 10. A version compiled with CUDA 9.2 worked without problems (I could not yet get my hands on a version compiled with 11.7, but I had one with 11.4 and that also worked).
So I tried CUDA 11.4 on Linux and - lo and behold - it works as well. Seems to really be a bug specific to the 11.7 toolkit (at least on Linux).

1 Like

The problem on the SO question was basically an out-of-memory error. For the original posting in this thread, I doubt it is an out-of-memory issue. If it were me, and I had a 4090 to test on, I definitely would not use anything other than CUDA 11.8 or newer.

What I can say now is that the same code from my original post executes on Windows 10 using CUDA 11.8 on my 4090 without errors.
So it seems to be a bug just for the WSL/Linux version of CUDA 11.8 , in conjunction with the RTX40 series – that’s all I got in terms of additional insights so far. 🙂
I’m sure one of your engineers can figure it out eventually.

If the pytorch is compiled to use CUDA 11.6 or CUDA 11.7, I doubt it is using CUDA 11.8. That typically doesn’t work. The pythonic pytorch installs that I am familiar with on linux bring their own CUDA libraries for this reason. I can’t tell how it was installed here.

Those CUDA 11.6/11.7 CUFFT libraries may not work correctly with 4090. That was the reason for my comment. NVIDIA recommends CUDA 11.8 minimum for use with RTX 40 series GPUs, and its often the case that it takes a while for DL framework “providers” to catch up with these needs and provide a new version that is linked against CUDA 11.8 (in this case) and provides CUDA 11.8 libraries. Or you can build your own pytorch.

Using a pytorch set up in an NGC container may be another option. The WSL documentation explains how to launch NGC containers.

To test the theory of a basic CUDA 11.8 CUFFT bug in WSL, I would run a test like what was already suggested - a pure CUFFT code linked against CUDA 11.8.

Aaah. A lot of great ideas, thankyou! I’ll look into it!