When using more than one Nvidia RTX 2070 the other GPUs fail to initialize

When attempting to use more than one Nvidia RTX 2070, GPUs starting with number 2 fail to initialize with the following error when running dmesg | grep NVRM shows the following errors:
rm_init_adapter failed for device bearing minor number 1
rm_init_adapter failed for device bearing minor number 2

rm_init_adapter failed for device bearing minor number 9

Running on Debian 9 and tried both of driver version 410.93 and 415.27 and CUDA 10.0, with the same result
nvidia-bug-report.log.gz (845 KB)
lspci.log (3.27 KB)
nvrm.log (5.66 KB)

After further analysis today the possibility of hardware issues is ruled out.

  • When limiting the attempts to only two GPUs and the issue remains.
  • When trying each one, one a time, both work without any issues.
    However there is something interesting in the logs which were previously submitted:
    The first GPU gets it’s own IRQ (146) while all remaining GPUs get the same IRQ (147) and the UUID is not retrieved.
    Also, when listing the full list of cards, it seems that IRQ is shared as well.
    Could this be the cause of the issues?
    Please provide instructions as to further diagnose the problem.

Thank you.

*** /proc/driver/nvidia/./gpus/0000:01:00.0/information
*** ls: -r–r--r-- 1 root root 0 2019-01-20 12:53:29.382727370 -0500 /proc/driver/nvidia/./gpus/0000:01:00.0/information
Model: GeForce RTX 2070
IRQ: 146
GPU UUID: GPU-89d75a90-99a5-426b-45f6-c418350c1a58
Video BIOS: 90.06.0b.80.00
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:01:00.0
Device Minor: 0
Blacklisted: No


*** /proc/driver/nvidia/./gpus/0000:01:00.0/registry
*** ls: -rw-r–r-- 1 root root 0 2019-01-20 12:53:29.398725606 -0500 /proc/driver/nvidia/./gpus/0000:01:00.0/registry
Binary: “”


*** /proc/driver/nvidia/./gpus/0000:02:00.0/information
*** ls: -r–r--r-- 1 root root 0 2019-01-20 12:53:29.414723842 -0500 /proc/driver/nvidia/./gpus/0000:02:00.0/information
Model: Unknown
IRQ: 147
GPU UUID: GPU-???-???-???-???-???
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:02:00.0
Device Minor: 1


*** /proc/driver/nvidia/./gpus/0000:02:00.0/registry
*** ls: -rw-r–r-- 1 root root 0 2019-01-20 12:53:29.526711492 -0500 /proc/driver/nvidia/./gpus/0000:02:00.0/registry
Binary: “”


*** /proc/driver/nvidia/./gpus/0000:03:00.0/information
*** ls: -r–r--r-- 1 root root 0 2019-01-20 12:53:29.534710610 -0500 /proc/driver/nvidia/./gpus/0000:03:00.0/information
Model: Unknown
IRQ: 147
GPU UUID: GPU-???-???-???-???-???
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:03:00.0
Device Minor: 2
Blacklisted: No

[…]

*** /proc/asound/cards
*** ls: -r–r--r-- 1 root root 0 2019-01-20 12:53:30.698582743 -0500 /proc/asound/cards
0 [NVidia ]: HDA-Intel - HDA NVidia
HDA NVidia at 0x57080000 irq 17
1 [NVidia_6 ]: HDA-Intel - HDA NVidia
HDA NVidia at 0x55080000 irq 16
2 [NVidia_2 ]: HDA-Intel - HDA NVidia
HDA NVidia at 0x53080000 irq 19
3 [NVidia_5 ]: HDA-Intel - HDA NVidia
HDA NVidia at 0x51080000 irq 16
4 [NVidia_1 ]: HDA-Intel - HDA NVidia
HDA NVidia at 0x4f080000 irq 18
5 [NVidia_3 ]: HDA-Intel - HDA NVidia
HDA NVidia at 0x4d080000 irq 19
6 [NVidia_8 ]: HDA-Intel - HDA NVidia
HDA NVidia at 0x4b080000 irq 16
7 [NVidia_9 ]: HDA-Intel - HDA NVidia
HDA NVidia at 0x49080000 irq 17
8 [NVidia_4 ]: HDA-Intel - HDA NVidia
HDA NVidia at 0x47080000 irq 18
9 [NVidia_B ]: HDA-Intel - HDA NVidia
HDA NVidia at 0x45080000 irq 19
10 [NVidia_7 ]: HDA-Intel - HDA NVidia
HDA NVidia at 0x43080000 irq 16
11 [NVidia_A ]: HDA-Intel - HDA NVidia
HDA NVidia at 0x41080000 irq 18

Hi,

I connected 2 GPUs RTX 2070 on system Precision WorkStation T7500 and Alienware Area-51 R3; installed driver 415.27 and connected 1 monitor over DVI/HDMI port but unable to repro issue.

Tested with OS Debian 9.1 and CentOS 7.5 in EFI mode but no luck.

Can you please share nvidia bug report in state of repro for 2 GPUs and output of xrandr –verbose.

I suspect the same MSI is just a side-effect of the gpus failing on driver init. Kernel tries to assign 147 to the second gpu, that fails so it tries to assign it to the next one.
Did you try using a more recent kernel than 4.9?

Here is the result after several other attempts.
The bottom line, it’s still not working, even on brand new installs of either Debian or Ubuntu. After installing using the default settings (plus Gnome for Debian) and applying all possible upgrades and dist-upgrades then installing the driver

  • A brand new clean install of Debian 9.7 (latest release from a few days ago) has the latest kernel 4.9
  • A brand new clean install of Ubuntu 18.04.LTS has the latest kernel 4.15;
    Using the current Debian installation:
  • Tried on multiple systems
  • Running multiple 1070 and 1070ti GPUs works on the very system that running two 2070 or two 2060 does not, using the same driver.
  • Only on one system did is actually manage to load the driver for two 2070, on another identical system it did not. Did not try with three 2070s.
  • Tried the following options:
nomodeset

;

  • or
options nvidia_drm modeset=1
  • even both
  • confirmed using
cat /sys/module/nvidia_drm/parameters/modeset

Running xrandr -verbose does not work whenever the driver fails to initialize, it says it cannot connect to the display, which makes sense since the screen goes blank soon after boot.
Please find attached various logs with attempts.

03.2x2070.nvidia-bug-report.log.gz (1020 KB)
dmesg.with.modesetON.log (95.9 KB)
03.3x1070.with.2x2070.nvidia-bug-report.log.gz (1.67 MB)
fresh.debian.9.7.xrandr.log (19 Bytes)
fresh.ubuntu.18.04.1.dmesg.log (64.9 KB)
02.working.nvidia-bug-report.log.gz (2.84 MB)
fresh.debian.9.7.nvidia-bug-report.log.gz (1020 KB)
fresh.ubuntu.18.04.1.nvidia-bug-report.log.gz (1000 KB)

I am looking for system which has Asus motherboard B250 MINING EXPERT so that I can attempt for repro.
Have tried on other systems but unable to repro.
If you had observed similar issue in other systems as well, please let me know.

I have a possibly similar issue; with one RTX2070 and two GTX1060s. Linux 4.15, driver 415.13.

The 1060s work fine, the 2070 does not:

Jan 29 10:25:34 NV kernel: nvidia: loading out-of-tree module taints kernel.
Jan 29 10:25:34 NV kernel: nvidia: module license 'NVIDIA' taints kernel.
Jan 29 10:25:34 NV kernel: Disabling lock debugging due to kernel taint
Jan 29 10:25:34 NV kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Jan 29 10:25:34 NV kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 243
Jan 29 10:25:34 NV kernel: nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
Jan 29 10:25:34 NV kernel: nvidia 0000:02:00.0: enabling device (0000 -> 0003)
Jan 29 10:25:34 NV kernel: nvidia 0000:02:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
Jan 29 10:25:34 NV kernel: nvidia 0000:03:00.0: enabling device (0000 -> 0003)
Jan 29 10:25:34 NV kernel: nvidia 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
Jan 29 10:25:34 NV kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  415.13  Wed Oct 31 19:07:36 CDT 2018 (using threaded interrupts)
Jan 29 10:25:34 NV kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  415.13  Wed Oct 31 18:49:37 CDT 2018
Jan 29 10:25:34 NV kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Jan 29 10:25:34 NV kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
Jan 29 10:25:34 NV kernel: caller os_map_kernel_space.part.6+0x6d/0x80 [nvidia] mapping multiple BARs
Jan 29 10:25:35 NV kernel: NVRM: GPU at PCI:0000:01:00: GPU-674eda50-4e64-c592-fe69-17905595ad45
Jan 29 10:25:35 NV kernel: NVRM: GPU Board Serial Number: 
Jan 29 10:25:35 NV kernel: NVRM: Xid (PCI:0000:01:00): 61, 0c06(2f88) 00000000 00000000
Jan 29 10:25:43 NV kernel: nvidia-modeset: ERROR: GPU:0: Display engine push buffer channel allocation failed: 0x65 (Call timed out [NV_ERR_TIMEOUT])
Jan 29 10:25:43 NV kernel: nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer
Jan 29 10:26:15 NV kernel: [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
Jan 29 10:26:15 NV kernel: [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
Jan 29 10:26:15 NV kernel: [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver

And subsequently;

Jan 29 10:26:22 NV kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
Jan 29 10:26:22 NV kernel: caller os_map_kernel_space.part.6+0x6d/0x80 [nvidia] mapping multiple BARs
Jan 29 10:26:26 NV kernel: NVRM: RmInitAdapter failed! (0x24:0x65:1062)
Jan 29 10:26:26 NV kernel: NVRM: rm_init_adapter failed for device bearing minor number 0

nvidia-bug-report.log.gz (1.45 MB)

The issue with the initialization was caused by the BIOS/UEFI configuration. After changing the PCIE link speed from 1x to at least 2x the driver does manage to initialize.
Now there is another issue.

  • When using more than (2 or 3) x 2070s the dmesg gets filled with the following errors.
  • When using more than about (4 or 5) x 2070s the screen starts scrolling these errors very fast and the system becomes unresponsive even to ssh connections
  • Attempted to:
  • Change a few ACPI related settings in BIOS
  • the previous freshly installed Ubuntu 18.04.1 LTS; same issue.
  • boot Debian with kernel parameters:
    • pci=noaer
    • pci=nomsi
    • pcie_aspm=off
    • nomodeset
  • The errors:

    [   31.235957] dpc 0000:00:1c.5:pcie010: DPC containment event, status:0x1f00 source:0x0000
    [   31.235958] pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
    [   31.235962] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e5(Receiver ID)
    [   31.235963] pcieport 0000:00:1c.5:   device [8086:a33d] error status/mask=00000001/00002000
    [   31.235964] pcieport 0000:00:1c.5:    [ 0] Receiver Error         (First)
    [   31.236286] dpc 0000:00:1c.5:pcie010: DPC containment event, status:0x1f00 source:0x0000
    [   31.236290] pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
    [   31.236293] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e5(Receiver ID)
    [   31.236295] pcieport 0000:00:1c.5:   device [8086:a33d] error status/mask=00000001/00002000
    [   31.236296] pcieport 0000:00:1c.5:    [ 0] Receiver Error         (First)
    

    ubuntu.nvidia-bug-report.log.gz (2.66 MB)
    debian.nvidia-bug-report.log.gz (2.72 MB)

    Crosstalk from bad risers not fit for gen2 speeds.