CUDA 11.1 NVRTC Runtime Error, GA102 Architecture

CUDA: v11.1.0 , cuDNN: v8.0.4.30
OS: Windows 10 version 2004 (build 19041.572)
GPU: Geforce RTX 3090
Driver version: 456.71
Using Pytorch 1.8.0 nightly , same error using the just-released Pytorch 1.7

I’m running into an nvrtc compiler error when trying to run WaveGlow from Nvidia’s own github: GitHub - NVIDIA/waveglow: A Flow-based Generative Network for Speech Synthesis
The GA102 architecture should be fully supported in Cuda 11.1, correct? At least the release notes state so.
Yet it seems the compiler in question is not able to deal with the architecture in this case.

Other projects, such as Tacotron2, Huggingface transformers library, etc. work flawlessly, both for training and inference workloads.

WaveGlow on the other hand fails to run, with the NVRTC throwing an error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-5-8b32f5b07435> in <module>
      4 for k in waveglow.convinv:
      5     k.float()
----> 6 denoiser = Denoiser(waveglow)

K:\tts\AMP_TACOTRON2\tacotron2\waveglow\denoiser.py in __init__(self, waveglow, filter_length, n_overlap, win_length, mode)
     28 
     29         with torch.no_grad():
---> 30             bias_audio = waveglow.infer(mel_input, sigma=0.0).float()
     31             bias_spec, _ = self.stft.transform(bias_audio)
     32 

K:\tts\AMP_TACOTRON2\tacotron2\waveglow\glow.py in infer(self, spect, sigma)
    274             audio_1 = audio[:,n_half:,:]
    275 
--> 276             output = self.WN[k]((audio_0, spect))
    277 
    278             s = output[:, n_half:, :]

k:\tts\amp_tacotron2\venv\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    742             result = self._slow_forward(*input, **kwargs)
    743         else:
--> 744             result = self.forward(*input, **kwargs)
    745         for hook in itertools.chain(
    746                 _global_forward_hooks.values(),

K:\tts\AMP_TACOTRON2\tacotron2\waveglow\glow.py in forward(self, forward_input)
    164                 self.in_layers[i](audio),
    165                 spect[:,spect_offset:spect_offset+2*self.n_channels,:],
--> 166                 n_channels_tensor)
    167 
    168             res_skip_acts = self.res_skip_layers[i](acts)

RuntimeError: nvrtc: error: invalid value for --gpu-architecture (-arch)

nvrtc compilation failed: 

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)


template<typename T>
__device__ T maximum(T a, T b) {
  return isnan(a) ? a : (a > b ? a : b);
}

template<typename T>
__device__ T minimum(T a, T b) {
  return isnan(a) ? a : (a < b ? a : b);
}


#define __HALF_TO_US(var) *(reinterpret_cast<unsigned short *>(&(var)))
#define __HALF_TO_CUS(var) *(reinterpret_cast<const unsigned short *>(&(var)))
#if defined(__cplusplus)
  struct __align__(2) __half {
    __host__ __device__ __half() { }

  protected:
    unsigned short __x;
  };

  /* All intrinsic functions are only available to nvcc compilers */
  #if defined(__CUDACC__)
    /* Definitions of intrinsics */
    __device__ __half __float2half(const float f) {
      __half val;
      asm("{  cvt.rn.f16.f32 %0, %1;}\n" : "=h"(__HALF_TO_US(val)) : "f"(f));
      return val;
    }

    __device__ float __half2float(const __half h) {
      float val;
      asm("{  cvt.f32.f16 %0, %1;}\n" : "=f"(val) : "h"(__HALF_TO_CUS(h)));
      return val;
    }

  #endif /* defined(__CUDACC__) */
#endif /* defined(__cplusplus) */
#undef __HALF_TO_US
#undef __HALF_TO_CUS

typedef __half half;

extern "C" __global__
void func_1(half* t0, half* t1, half* aten_mul_flat) {
{
  float v = __half2float(t1[(512 * blockIdx.x + threadIdx.x) % 2816 + 2816 * (((512 * blockIdx.x + threadIdx.x) / 2816) % 256)]);
  float v_1 = __half2float(t0[(512 * blockIdx.x + threadIdx.x) % 2816 + 2816 * (((512 * blockIdx.x + threadIdx.x) / 2816) % 256)]);
  aten_mul_flat[512 * blockIdx.x + threadIdx.x] = __float2half((tanhf(v)) * (1.f / (1.f + (expf(0.f - v_1)))));
}
}

This was not mentioned as known issue anywhere (as far as I can see), so I wanted to ask if this is a known problem and going to be addressed in an upcoming version or not, or if there is something I’m overlooking which could fix the problem.

Reproduction is very trivial - just clone the code from github and run it, you’ll encounter the same compiler error each time.

Update: Tried it with the just released CUDA version 11.1.1 - same problem unfortunately.

I have no experience with the hardware and software mentioned, but based on your description it sounds like you would want to file a bug report with NVIDIA.

Yeah you’re right, maybe there aren’t enough people with RTX30 cards yet, given the supply situation… I’ll file a report for it.

You are not the only one with this problem.
With my RTX 3090 I have the same problem when training Efficientdet.

@jerkadar did you file a bug report? Where can I follow it? Did you find a way to solve it?

Device: RTX 3090
CUDA: 11.1.1
CuDNN: 8.0.5.39
Dirver: 455.45.01
OS: Linux 5.9.10-arch1-1 x86_64
Pytorch 1.7.0+cu110

Efficientdet training fails with

~/.local/share/virtualenvs/icevision-UV98Eklo/lib/python3.8/site-packages/effdet/bench.py in forward(self, x, target)
    126                 class_out, box_out, num_levels=self.num_levels, num_classes=self.num_classes)
--> 127             output['detections'] = _batch_detection(
    128                 x.shape[0], class_out_pp, box_out_pp, self.anchors.boxes, indices, classes,

RuntimeError: nvrtc: error: invalid value for --gpu-architecture (-arch)

nvrtc compilation failed: 

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)


template<typename T>
__device__ T maximum(T a, T b) {
  return isnan(a) ? a : (a > b ? a : b);
}

template<typename T>
__device__ T minimum(T a, T b) {
  return isnan(a) ? a : (a < b ? a : b);
}

@desjoerdhaan
Yes. I did file a bug report and nvidia got back to me. After some back and forth it was clear that, yes, the nvrtc can’t compile for featureset 8_6 at the moment, but the bug itself is mainly a fault of pytorch. The pytorch devs could not compile binaries for the new RTX GPUs because of a bug in the Cuda Toolkit. A fix for that is likely to be part of pytorch 1.7.1 (or so they hope), but in the meantime they did add a fix to the 1.8 nightlies. You should install those builds if you can.

For the pytorch side of things, check out these github issue for more insights: Pytorch 1.7.0 with cuda 11.1.1 and cudnn 8.0.5 · Issue #47669 · pytorch/pytorch · GitHub , and check out these fixes they implemented: [Release/1.7.1] Add max supported SM for nvrtc-11.0 by malfet · Pull Request #48309 · pytorch/pytorch · GitHub , Add max supported SM for nvrtc-11.0 by malfet · Pull Request #48151 · pytorch/pytorch · GitHub

I have been able to fully utilize pytorch again, using the 1.8 nightly builds.

@jerkadar Thanks a lot, that is very helpful!

@desjoerdhaan Did using the nightly builds work for you? I switched to nightly but am still getting the exact same error that you mention above.

In case anyone else is looking - my workaround for this was just to create a custom DetBenchTrain class that called a version of _batch_detection that removed the @torch.jit.script decorator. This at least allowed Efficientdet to run on the RTX3090, albeit slightly slower.