CuFFT not working on L4 card but working on T4?

I have an application running in a docker container that (among other things) uses Nvidia NeMo’s Clustering Diarizer. This has worked totally fine on a GCP hosted T4 instance. I was curious to try out the newer L4 GPUs and hence proceeded as usual with my installation. On the L4 instance I’m getting this:

 File "nemo/collections/asr/models/clustering_diarizer.py", line 447, in diarize
    self._extract_embeddings(self.subsegments_manifest_path, scale_idx, len(scales))
  File "nemo/collections/asr/models/clustering_diarizer.py", line 359, in _extract_embeddings
    _, embs = self._speaker_model.forward(input_signal=audio_signal, input_signal_length=audio_signal_len)
  File "nemo/core/classes/common.py", line 1087, in __call__
    outputs = wrapped(*args, **kwargs)
  File "nemo/collections/asr/models/label_models.py", line 327, in forward
    processed_signal, processed_signal_len = self.preprocessor(
  File "torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "nemo/core/classes/common.py", line 1087, in __call__
    outputs = wrapped(*args, **kwargs)
  File "torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "nemo/collections/asr/modules/audio_preprocessing.py", line 91, in forward
    processed_signal, processed_length = self.get_features(input_signal, length)
  File "nemo/collections/asr/modules/audio_preprocessing.py", line 292, in get_features
    return self.featurizer(input_signal, length)
  File "torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "nemo/collections/asr/parts/preprocessing/features.py", line 420, in forward
    x = self.stft(x)
  File "nemo/collections/asr/parts/preprocessing/features.py", line 310, in <lambda>
    self.stft = lambda x: torch.stft(
  File "torch/functional.py", line 632, in stft
    return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR

Whilst on my T4 instance everything works perfectly fine.

Info about the two VM:s:

T4 Instance

  • VERSION=“20.04.5 LTS (Focal Fossa)”
  • Kernel: 5.15.0-1037-gcp
  • nvcc --version = Cuda compilation tools, release 10.1, V10.1.243
  • nvidia-smi → NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0

L4Instance

  • VERSION=“20.04.6 LTS (Focal Fossa)”
  • Kernel: 5.15.0-1037-gcp
  • nvcc --version = Cuda compilation tools, release 10.1, V10.1.243
  • nvidia-smi → NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2

They seemingly have the same CUDA versions and hence I’m clueless to why I’m getting this error, any help would be highly appreciated.

It could be because your version of cuFFT (if it came with the Cuda Toolkit), is too old.

The L4 is an Ada Lovelace Compute capability 8.9 card, which Cuda 10.1 does not support. CC8.9 was not supported until 11.8.