CUDA 11.1 NVRTC Runtime Error, GA102 Architecture

jerkadar · October 27, 2020, 1:56pm

CUDA: v11.1.0 , cuDNN: v8.0.4.30
OS: Windows 10 version 2004 (build 19041.572)
GPU: Geforce RTX 3090
Driver version: 456.71
Using Pytorch 1.8.0 nightly , same error using the just-released Pytorch 1.7

I’m running into an nvrtc compiler error when trying to run WaveGlow from Nvidia’s own github: GitHub - NVIDIA/waveglow: A Flow-based Generative Network for Speech Synthesis
The GA102 architecture should be fully supported in Cuda 11.1, correct? At least the release notes state so.
Yet it seems the compiler in question is not able to deal with the architecture in this case.

Other projects, such as Tacotron2, Huggingface transformers library, etc. work flawlessly, both for training and inference workloads.

WaveGlow on the other hand fails to run, with the NVRTC throwing an error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-5-8b32f5b07435> in <module>
      4 for k in waveglow.convinv:
      5     k.float()
----> 6 denoiser = Denoiser(waveglow)

K:\tts\AMP_TACOTRON2\tacotron2\waveglow\denoiser.py in __init__(self, waveglow, filter_length, n_overlap, win_length, mode)
     28 
     29         with torch.no_grad():
---> 30             bias_audio = waveglow.infer(mel_input, sigma=0.0).float()
     31             bias_spec, _ = self.stft.transform(bias_audio)
     32 

K:\tts\AMP_TACOTRON2\tacotron2\waveglow\glow.py in infer(self, spect, sigma)
    274             audio_1 = audio[:,n_half:,:]
    275 
--> 276             output = self.WN[k]((audio_0, spect))
    277 
    278             s = output[:, n_half:, :]

k:\tts\amp_tacotron2\venv\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    742             result = self._slow_forward(*input, **kwargs)
    743         else:
--> 744             result = self.forward(*input, **kwargs)
    745         for hook in itertools.chain(
    746                 _global_forward_hooks.values(),

K:\tts\AMP_TACOTRON2\tacotron2\waveglow\glow.py in forward(self, forward_input)
    164                 self.in_layers[i](audio),
    165                 spect[:,spect_offset:spect_offset+2*self.n_channels,:],
--> 166                 n_channels_tensor)
    167 
    168             res_skip_acts = self.res_skip_layers[i](acts)

RuntimeError: nvrtc: error: invalid value for --gpu-architecture (-arch)

nvrtc compilation failed: 

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)


template<typename T>
__device__ T maximum(T a, T b) {
  return isnan(a) ? a : (a > b ? a : b);
}

template<typename T>
__device__ T minimum(T a, T b) {
  return isnan(a) ? a : (a < b ? a : b);
}


#define __HALF_TO_US(var) *(reinterpret_cast<unsigned short *>(&(var)))
#define __HALF_TO_CUS(var) *(reinterpret_cast<const unsigned short *>(&(var)))
#if defined(__cplusplus)
  struct __align__(2) __half {
    __host__ __device__ __half() { }

  protected:
    unsigned short __x;
  };

  /* All intrinsic functions are only available to nvcc compilers */
  #if defined(__CUDACC__)
    /* Definitions of intrinsics */
    __device__ __half __float2half(const float f) {
      __half val;
      asm("{  cvt.rn.f16.f32 %0, %1;}\n" : "=h"(__HALF_TO_US(val)) : "f"(f));
      return val;
    }

    __device__ float __half2float(const __half h) {
      float val;
      asm("{  cvt.f32.f16 %0, %1;}\n" : "=f"(val) : "h"(__HALF_TO_CUS(h)));
      return val;
    }

  #endif /* defined(__CUDACC__) */
#endif /* defined(__cplusplus) */
#undef __HALF_TO_US
#undef __HALF_TO_CUS

typedef __half half;

extern "C" __global__
void func_1(half* t0, half* t1, half* aten_mul_flat) {
{
  float v = __half2float(t1[(512 * blockIdx.x + threadIdx.x) % 2816 + 2816 * (((512 * blockIdx.x + threadIdx.x) / 2816) % 256)]);
  float v_1 = __half2float(t0[(512 * blockIdx.x + threadIdx.x) % 2816 + 2816 * (((512 * blockIdx.x + threadIdx.x) / 2816) % 256)]);
  aten_mul_flat[512 * blockIdx.x + threadIdx.x] = __float2half((tanhf(v)) * (1.f / (1.f + (expf(0.f - v_1)))));
}
}

This was not mentioned as known issue anywhere (as far as I can see), so I wanted to ask if this is a known problem and going to be addressed in an upcoming version or not, or if there is something I’m overlooking which could fix the problem.

Reproduction is very trivial - just clone the code from github and run it, you’ll encounter the same compiler error each time.

jerkadar · October 31, 2020, 6:56pm

Update: Tried it with the just released CUDA version 11.1.1 - same problem unfortunately.

njuffa · October 31, 2020, 9:33pm

I have no experience with the hardware and software mentioned, but based on your description it sounds like you would want to file a bug report with NVIDIA.

jerkadar · October 31, 2020, 9:55pm

Yeah you’re right, maybe there aren’t enough people with RTX30 cards yet, given the supply situation… I’ll file a report for it.

desjoerdhaan · December 2, 2020, 9:02am

You are not the only one with this problem.
With my RTX 3090 I have the same problem when training Efficientdet.

@jerkadar did you file a bug report? Where can I follow it? Did you find a way to solve it?

Device: RTX 3090
CUDA: 11.1.1
CuDNN: 8.0.5.39
Dirver: 455.45.01
OS: Linux 5.9.10-arch1-1 x86_64
Pytorch 1.7.0+cu110

Efficientdet training fails with

~/.local/share/virtualenvs/icevision-UV98Eklo/lib/python3.8/site-packages/effdet/bench.py in forward(self, x, target)
    126                 class_out, box_out, num_levels=self.num_levels, num_classes=self.num_classes)
--> 127             output['detections'] = _batch_detection(
    128                 x.shape[0], class_out_pp, box_out_pp, self.anchors.boxes, indices, classes,

RuntimeError: nvrtc: error: invalid value for --gpu-architecture (-arch)

nvrtc compilation failed: 

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)


template<typename T>
__device__ T maximum(T a, T b) {
  return isnan(a) ? a : (a > b ? a : b);
}

template<typename T>
__device__ T minimum(T a, T b) {
  return isnan(a) ? a : (a < b ? a : b);
}

jerkadar · December 2, 2020, 9:39am

@desjoerdhaan
Yes. I did file a bug report and nvidia got back to me. After some back and forth it was clear that, yes, the nvrtc can’t compile for featureset 8_6 at the moment, but the bug itself is mainly a fault of pytorch. The pytorch devs could not compile binaries for the new RTX GPUs because of a bug in the Cuda Toolkit. A fix for that is likely to be part of pytorch 1.7.1 (or so they hope), but in the meantime they did add a fix to the 1.8 nightlies. You should install those builds if you can.

For the pytorch side of things, check out these github issue for more insights: Pytorch 1.7.0 with cuda 11.1.1 and cudnn 8.0.5 · Issue #47669 · pytorch/pytorch · GitHub , and check out these fixes they implemented: [Release/1.7.1] Add max supported SM for nvrtc-11.0 by malfet · Pull Request #48309 · pytorch/pytorch · GitHub , Add max supported SM for nvrtc-11.0 by malfet · Pull Request #48151 · pytorch/pytorch · GitHub

I have been able to fully utilize pytorch again, using the 1.8 nightly builds.

desjoerdhaan · December 2, 2020, 2:47pm

@jerkadar Thanks a lot, that is very helpful!

fergusoci · December 5, 2020, 6:29pm

@desjoerdhaan Did using the nightly builds work for you? I switched to nightly but am still getting the exact same error that you mention above.

fergusoci · December 7, 2020, 11:16am

In case anyone else is looking - my workaround for this was just to create a custom DetBenchTrain class that called a version of _batch_detection that removed the @torch.jit.script decorator. This at least allowed Efficientdet to run on the RTX3090, albeit slightly slower.

Topic		Replies	Views
NVIDIA TensorRT not compatible with CUDA 11.1 CUDA Developer Tools	0	861	October 15, 2020
CUDA 11.0, Windoze 10: Error 801 when trying to execute a runtime API function CUDA Programming and Performance opencv , cuda	3	1097	March 16, 2021
NVIDIA TensorRT not compatible with CUDA 11.1 TensorRT	4	1937	October 12, 2021
Forward compatibility was attempted on non supported HW TAO Toolkit cuda	6	15402	March 9, 2022
Lot of AI repos cannot run on RTX 30 series GPU TensorRT	5	1125	June 28, 2021
Value"sm_75" not defined with CUDA 9.0 installed on RTX 2070 CUDA Setup and Installation	4	5758	November 20, 2018
Nvcc fatal : Unsupported gpu architecture 'compute_86' CUDA Setup and Installation	7	37387	November 13, 2023
Unable to use TensorRT 7.2.3 for Ubuntu 18.04 and CUDA 11.2 TensorRT	6	4595	May 6, 2021
cuda/caskConvolutionLayer.cpp (256) - Cuda Error in execute: 11 Jetson AGX Xavier	3	1319	October 18, 2021
Unable to compile and run program cuda 11.2 , vs2019 CUDA Setup and Installation	2	1172	June 7, 2021

CUDA 11.1 NVRTC Runtime Error, GA102 Architecture

Related topics