nvidia-smi can't see RTX 2080 Ti - bad hardware? should I RMA?

I bought a ‘EVGA GeForce RTX 2080 Ti XC GAMING, 11G-P4-2382-KR, 11GB GDDR6, Dual HDB Fans & RGB LED’ on New Years Eve.

This is for a headless (linux cuda for some neural network experimentation) mini-ITX build:

  • Thermaltake Core V1 [includes 200mm front fan]
  • 2x SilenX EFX-08-12 Effizio 80x25mm 12dBA 25CFM PC Computer Case Fans [rear]
  • EVGA SuperNOVA 550 G2, 80+ GOLD 550W, Fully Modular, EVGA ECO Mode, Power Supply 220-G2-0550-Y1
  • ASUS ROG MAXIMUS VIII IMPACT Motherboard
  • i7-6700k CPU [delidded, liquid metal, copper aftermarket heatsink]
  • Noctua NH-U9S, Premium CPU Cooler with NF-A9 92mm Fan (Brown)
  • Corsair Vengeance LPX 32GB (2x16GB) DDR4 DRAM 2400MHz (PC4-19200) C14 Memory Kit - Black (CMK32GX4M2A2400C14)
  • 500 GB / 476 GiB 2.5" SATA III SSD [MTFDDAK512MBF-1AN1ZABHA] HP/MICRON 512GB MLC SATA3 2.5" SFF M600 SER

and as such there isn’t really much space for a graphics card, and - after extensive searching - this was the only RTX 2080 Ti card that I could find with appropriate width and length. It fits very nicely and the fans pull in air straight through case perforations (in through the left side and out the perforated top, I’ve placed the non-perforated see-through cover on the right instead of on top - they’re interchangeable). Hence cooling shouldn’t be a problem. The power supply is perhaps a little on the light side, but (a) I’ve tried a more powerful one [Seasonic Prime 650 Titanium SSR-650TR 650W 80+ Titanium ATX12V & EPS12V Full Modular 135mm FDB] and (b) problems were rampant even with graphics card severely power limited and © there’s no 3.5" HDD, no optical drive, only a single SSD and a 95W TDP non-overclocked cpu - thus I find it hard to believe that the rest of the system needs more than 200W.

This box is rock solid without the graphics card. Indeed even @4.3GHz all-4-cores overclock, no avx offset and mprime avx workload cpu temperature barely hits 69C after an hour. (but I reverted everything to stock while installing/testing the graphics card)

I’ve also done limited testing in a second larger mini ITX box:

  • Phanteks Enthoo Evolv iTX Case, Window PH-ES215P_BK Black [incl. 200mm front fan]
  • Phanteks 140mm Case/Radiator Cooling Fan (PH-F140XP_BK) [rear case fan]
  • Seasonic Prime 650 Titanium SSr-650TR 650W 80+ Titanium
  • ASUS ROG Strix Z390-I Gaming
  • i7-8086k delidded, liquid metal, copper heatsink
  • Noctua NH-D15S with 2nd NF-A15 fan
  • Mushkin Redline Series - DDR4 DRAM - 32GB (2x16GB) Memory Kit DIMM - 2666MHz (PC4-21300) CL-16 - 288-pin 1.2V Desktop RAM - Non-ECC - Dual-Channel
  • Intel SSD 660p Series (512GB M.2 80mm PCIe 3.0 x 4 3D2 QLC) 2 2287" (978349)
  • Samsung SSD 840 PRO Series

I’ve tried:

  • Fedora 27
  • Fedora 28
  • Fedora 29

Drivers:

  • from NVidia cuda yum repository (410.79)
  • from Nvidia .run installers (410.93, 415.27, 418.30)
  • from rpmdusion yum repository (415.27 for F29, older versions for F28/F27)
    (possibly some older ones, I’ve been at it on and off for a month now)

I’ve tried most of the combinations of F27/28/29 with the 5 nvidia drivers on the first box, and only a few combinations of F29 on the second box.

At this point I’ve spent more time troubleshooting this setup/card then the card is worth (cost 1400$ including tax).

I did initially get it to work, but it wasn’t stable (under cuda NN workload it would error out till reboot). Reducing power draw (via nvidia-smi) didn’t help. I varied drivers and the like but - in hindsight - with every reboot things seemed to get worse. Eventually (with 2 days of buying it) it was pretty much unusable and I exchanged it at the store for another one. This one appears to be no better.

The core problem appears to be that ‘nvidia-smi’ causes:

 11 kernel: NVRM: RmInitAdapter failed! (0x24:0x65:1090)
 14 kernel: NVRM: RmInitAdapter failed! (0x26:0x65:1127)
 24 kernel: NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
  6 kernel: NVRM: RmInitAdapter failed! (0x26:0xffff:1107)
  4 kernel: NVRM: RmInitAdapter failed! (0x26:0xffff:1127)

lspci -nn | egrep -i ‘vga|nvidia’

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
01:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Controller [10de:1ad6] (rev a1)
01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 UCSI Controller [10de:1ad7] (rev a1)

Interestingly bios graphics mode works, boot mode text mode works.
But as soon as ‘nvidia-smi’ is run one of the above errors gets logged,
the screen switches from ‘normal’ text mode to mostly normal text mode with some small number of characters being multi-color static and nvidia-smi can’t find the device.

nvidia-smi

No devices were found

(this didn’t use to be the case… it used to work)

So… my questions are: is this a driver problem? is this a hardware problem? did I get unlucky and hit two bad cards in a row? should I RMA? do the RTX 2080 Ti’s actually work for Linux CUDA workloads?
nvidia-bug-report.log.gz (523 KB)

Worth noting - I’ve also experimented with kernel command line options ‘pcie_aspm=off rcutree.rcu_idle_gp_delay=1’ – don’t help.

Here’s relevant parts of kernel log from Fedora 29 + nvidia drivers from rpmfusion yum repository.
[ 41.186105] nvidia: loading out-of-tree module taints kernel.
[ 41.186411] nvidia: module license ‘NVIDIA’ taints kernel.
[ 41.186706] Disabling lock debugging due to kernel taint
[ 41.189868] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 41.197039] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[ 41.197875] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 41.387443] ahci 0000:00:17.0: port does not support device sleep
[ 41.420822] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 415.27 Thu Dec 20 17:25:03 CST 2018 (using threaded interrupts)
[ 41.434383] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input15
[ 41.435479] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input16
[ 41.436567] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input17
[ 41.437651] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input18
[ 41.440446] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 511
[ 41.451365] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 415.27 Thu Dec 20 17:06:08 CST 2018
[ 41.456363] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 42.088270] NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
[ 42.088595] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 42.088998] [drm:nv_drm_load [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[ 42.089699] [drm:nv_drm_probe_devices [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00000100] Failed to register device

Worked - degraded - dead. RMA it.