CUDA 11.1 NVRTC Runtime Error, GA102 Architecture

CUDA: v11.1.0 , cuDNN: v8.0.4.30
OS: Windows 10 version 2004 (build 19041.572)
GPU: Geforce RTX 3090
Driver version: 456.71
Using Pytorch 1.8.0 nightly , same error using the just-released Pytorch 1.7

I’m running into an nvrtc compiler error when trying to run WaveGlow from Nvidia’s own github: https://github.com/NVIDIA/waveglow
The GA102 architecture should be fully supported in Cuda 11.1, correct? At least the release notes state so.
Yet it seems the compiler in question is not able to deal with the architecture in this case.

Other projects, such as Tacotron2, Huggingface transformers library, etc. work flawlessly, both for training and inference workloads.

WaveGlow on the other hand fails to run, with the NVRTC throwing an error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-5-8b32f5b07435> in <module>
      4 for k in waveglow.convinv:
      5     k.float()
----> 6 denoiser = Denoiser(waveglow)

K:\tts\AMP_TACOTRON2\tacotron2\waveglow\denoiser.py in __init__(self, waveglow, filter_length, n_overlap, win_length, mode)
     28 
     29         with torch.no_grad():
---> 30             bias_audio = waveglow.infer(mel_input, sigma=0.0).float()
     31             bias_spec, _ = self.stft.transform(bias_audio)
     32 

K:\tts\AMP_TACOTRON2\tacotron2\waveglow\glow.py in infer(self, spect, sigma)
    274             audio_1 = audio[:,n_half:,:]
    275 
--> 276             output = self.WN[k]((audio_0, spect))
    277 
    278             s = output[:, n_half:, :]

k:\tts\amp_tacotron2\venv\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    742             result = self._slow_forward(*input, **kwargs)
    743         else:
--> 744             result = self.forward(*input, **kwargs)
    745         for hook in itertools.chain(
    746                 _global_forward_hooks.values(),

K:\tts\AMP_TACOTRON2\tacotron2\waveglow\glow.py in forward(self, forward_input)
    164                 self.in_layers[i](audio),
    165                 spect[:,spect_offset:spect_offset+2*self.n_channels,:],
--> 166                 n_channels_tensor)
    167 
    168             res_skip_acts = self.res_skip_layers[i](acts)

RuntimeError: nvrtc: error: invalid value for --gpu-architecture (-arch)

nvrtc compilation failed: 

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)


template<typename T>
__device__ T maximum(T a, T b) {
  return isnan(a) ? a : (a > b ? a : b);
}

template<typename T>
__device__ T minimum(T a, T b) {
  return isnan(a) ? a : (a < b ? a : b);
}


#define __HALF_TO_US(var) *(reinterpret_cast<unsigned short *>(&(var)))
#define __HALF_TO_CUS(var) *(reinterpret_cast<const unsigned short *>(&(var)))
#if defined(__cplusplus)
  struct __align__(2) __half {
    __host__ __device__ __half() { }

  protected:
    unsigned short __x;
  };

  /* All intrinsic functions are only available to nvcc compilers */
  #if defined(__CUDACC__)
    /* Definitions of intrinsics */
    __device__ __half __float2half(const float f) {
      __half val;
      asm("{  cvt.rn.f16.f32 %0, %1;}\n" : "=h"(__HALF_TO_US(val)) : "f"(f));
      return val;
    }

    __device__ float __half2float(const __half h) {
      float val;
      asm("{  cvt.f32.f16 %0, %1;}\n" : "=f"(val) : "h"(__HALF_TO_CUS(h)));
      return val;
    }

  #endif /* defined(__CUDACC__) */
#endif /* defined(__cplusplus) */
#undef __HALF_TO_US
#undef __HALF_TO_CUS

typedef __half half;

extern "C" __global__
void func_1(half* t0, half* t1, half* aten_mul_flat) {
{
  float v = __half2float(t1[(512 * blockIdx.x + threadIdx.x) % 2816 + 2816 * (((512 * blockIdx.x + threadIdx.x) / 2816) % 256)]);
  float v_1 = __half2float(t0[(512 * blockIdx.x + threadIdx.x) % 2816 + 2816 * (((512 * blockIdx.x + threadIdx.x) / 2816) % 256)]);
  aten_mul_flat[512 * blockIdx.x + threadIdx.x] = __float2half((tanhf(v)) * (1.f / (1.f + (expf(0.f - v_1)))));
}
}

This was not mentioned as known issue anywhere (as far as I can see), so I wanted to ask if this is a known problem and going to be addressed in an upcoming version or not, or if there is something I’m overlooking which could fix the problem.

Reproduction is very trivial - just clone the code from github and run it, you’ll encounter the same compiler error each time.

Update: Tried it with the just released CUDA version 11.1.1 - same problem unfortunately.

I have no experience with the hardware and software mentioned, but based on your description it sounds like you would want to file a bug report with NVIDIA.

Yeah you’re right, maybe there aren’t enough people with RTX30 cards yet, given the supply situation… I’ll file a report for it.