PCIE Bus Error with two NVIDIA cards on Linux

Hi,

We’ve tried to insert older Gigabyte Winforce GTX 770 8GB card in newer PC with ASUS PRIME X399-A mainboard and Zotac GTX 1070 8GB to try make use of it for GPU rendering (not for SLI). Computer runs Kubuntu 17.10 with 384.130 drivers and 4.13.0-45 kernel.

After 770 is inserted and Linux is booted up, whole system regularly freezes every few seconds for quite a few seconds (maybe 10 or more). It’s basically unusable.

Kernel logs shows number errors:

un 17 17:31:48 amd kernel: [   37.931432] nvidia-modeset: Allocated GPU:0 (GPU-137931c6-ad7d-7ece-8aef-c9f09f211c2d) @ PCI:0000:41:00.0
Jun 17 17:31:53 amd kernel: [   38.337823] dpc 0000:00:03.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
Jun 17 17:31:53 amd kernel: [   38.337826] pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
Jun 17 17:31:53 amd kernel: [   38.356511] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
Jun 17 17:31:53 amd kernel: [   38.356513] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001100/00006000
Jun 17 17:31:53 amd kernel: [   38.356514] pcieport 0000:00:03.1:    [ 8] RELAY_NUM Rollover
Jun 17 17:31:53 amd kernel: [   38.356515] pcieport 0000:00:03.1:    [12] Replay Timer Timeout
Jun 17 17:31:53 amd kernel: [   38.436821] dpc 0000:00:03.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
Jun 17 17:31:53 amd kernel: [   38.436828] pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
Jun 17 17:31:53 amd kernel: [   38.454678] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
Jun 17 17:31:53 amd kernel: [   38.454679] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001100/00006000
Jun 17 17:31:53 amd kernel: [   38.454680] pcieport 0000:00:03.1:    [ 8] RELAY_NUM Rollover
Jun 17 17:31:53 amd kernel: [   38.454681] pcieport 0000:00:03.1:    [12] Replay Timer Timeout
Jun 17 17:31:53 amd kernel: [   38.535820] dpc 0000:00:03.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
Jun 17 17:31:53 amd kernel: [   38.535822] pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
Jun 17 17:31:53 amd kernel: [   38.554995] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
Jun 17 17:31:53 amd kernel: [   38.554996] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001000/00006000
Jun 17 17:31:53 amd kernel: [   38.554997] pcieport 0000:00:03.1:    [12] Replay Timer Timeout
Jun 17 17:31:53 amd kernel: [   38.612823] dpc 0000:00:03.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
Jun 17 17:31:53 amd kernel: [   38.612829] pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
Jun 17 17:31:53 amd kernel: [   38.629099] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
Jun 17 17:31:53 amd kernel: [   38.629100] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001100/00006000
Jun 17 17:31:53 amd kernel: [   38.629101] pcieport 0000:00:03.1:    [ 8] RELAY_NUM Rollover
Jun 17 17:31:53 amd kernel: [   38.629102] pcieport 0000:00:03.1:    [12] Replay Timer Timeout
Jun 17 17:31:53 amd kernel: [   38.678822] dpc 0000:00:03.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
Jun 17 17:31:53 amd kernel: [   38.678824] pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
Jun 17 17:31:53 amd kernel: [   39.421055] dpc 0000:00:03.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
Jun 17 17:31:53 amd kernel: [   39.853649] usb 5-2: usbfs: USBDEVFS_CONTROL failed cmd vrlservice.bin rqt 192 rq 4 len 6 ret -110
Jun 17 17:31:53 amd kernel: [   39.853660] dpc 0000:00:03.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
Jun 17 17:31:53 amd kernel: [   39.853703] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
Jun 17 17:31:53 amd kernel: [   39.853705] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001100/00006000
Jun 17 17:31:53 amd kernel: [   39.853706] pcieport 0000:00:03.1:    [ 8] RELAY_NUM Rollover
Jun 17 17:31:53 amd kernel: [   39.853707] pcieport 0000:00:03.1:    [12] Replay Timer Timeout
Jun 17 17:31:53 amd kernel: [   39.853717] pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
Jun 17 17:31:53 amd kernel: [   39.853723] pcieport 0000:00:03.1: can't find device of ID0000
Jun 17 17:31:53 amd kernel: [   39.853724] pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
Jun 17 17:31:53 amd kernel: [   39.853738] pcieport 0000:00:03.1: can't find device of ID0000
Jun 17 17:31:53 amd kernel: [   39.855971] NVRM: GPU at PCI:0000:0b:00: GPU-ee0a5b68-20ee-5367-7e79-eb82b3751291
    Jun 17 17:31:53 amd kernel: [   39.855974] NVRM: Xid (PCI:0000:0b:00): 62, 1927(16e4) 00000000 00000000
    Jun 17 17:32:17 amd kernel: [   66.318961] NVRM: RmInitAdapter failed! (0x53:0xffff:1908)
    Jun 17 17:32:17 amd kernel: [   66.319007] NVRM: rm_init_adapter failed for device bearing minor number 0
    Jun 17 17:32:30 amd kernel: [   79.524688] FAT-fs (sde1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
    Jun 17 17:32:30 amd kernel: [   79.531561] FAT-fs (sdc1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
    Jun 17 17:32:30 amd kernel: [   79.554405] FAT-fs (sdg1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
    Jun 17 17:32:30 amd kernel: [   79.596072] EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: (null)
    Jun 17 17:32:35 amd kernel: [   80.380347] dpc 0000:00:03.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
    Jun 17 17:32:35 amd kernel: [   80.380353] pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
    Jun 17 17:32:35 amd kernel: [   80.399229] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
    Jun 17 17:32:35 amd kernel: [   80.399232] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001000/00006000
    Jun 17 17:32:35 amd kernel: [   80.399234] pcieport 0000:00:03.1:    [12] Replay Timer Timeout
    Jun 17 17:32:35 amd kernel: [   80.457301] dpc 0000:00:03.1:pcie010: DPC containment event, status:0x1f00 source:0x0000
    Jun 17 17:32:35 amd kernel: [   80.457305] pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
    Jun 17 17:32:35 amd kernel: [   80.473406] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
    Jun 17 17:32:35 amd kernel: [   80.473408] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001100/00006000
    Jun 17 17:32:35 amd kernel: [   80.473410] pcieport 0000:00:03.1:    [ 8] RELAY_NUM Rollover
    Jun 17 17:32:35 amd kernel: [   80.473412] pcieport 0000:00:03.1:    [12] Replay Timer Timeout
Jun 17 17:37:23 amd kernel: [  369.338406] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
    Jun 17 17:37:23 amd kernel: [  369.338407] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001100/00006000
    Jun 17 17:37:23 amd kernel: [  369.338408] pcieport 0000:00:03.1:    [ 8] RELAY_NUM Rollover
    Jun 17 17:37:23 amd kernel: [  369.338409] pcieport 0000:00:03.1:    [12] Replay Timer Timeout
    Jun 17 17:37:23 amd kernel: [  369.338413] pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
    Jun 17 17:37:23 amd kernel: [  369.338432] pcieport 0000:00:03.1: can't find device of ID0000
    Jun 17 17:37:23 amd kernel: [  369.338433] pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
    Jun 17 17:37:23 amd kernel: [  369.338465] pcieport 0000:00:03.1: can't find device of ID0000
    Jun 17 17:37:23 amd kernel: [  369.338466] pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
    Jun 17 17:37:23 amd kernel: [  369.338501] pcieport 0000:00:03.1: can't find device of ID0000
    Jun 17 17:37:23 amd kernel: [  369.338502] pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
    Jun 17 17:37:23 amd kernel: [  369.338529] pcieport 0000:00:03.1: can't find device of ID0000
    Jun 17 17:37:23 amd kernel: [  369.338530] pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
    Jun 17 17:37:23 amd kernel: [  369.338566] pcieport 0000:00:03.1: can't find device of ID0000
    Jun 17 17:37:23 amd kernel: [  369.338568] pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
    Jun 17 17:37:23 amd kernel: [  369.338607] pcieport 0000:00:03.1: can't find device of ID0000
    Jun 17 17:37:23 amd kernel: [  369.338608] pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
    Jun 17 17:37:23 amd kernel: [  369.338640] pcieport 0000:00:03.1: can't find device of ID0000
    Jun 17 17:37:23 amd kernel: [  369.342310] NVRM: GPU at PCI:0000:0b:00: GPU-ee0a5b68-20ee-5367-7e79-eb82b3751291
    Jun 17 17:37:23 amd kernel: [  369.342313] NVRM: Xid (PCI:0000:0b:00): 62, 1927(16e4) 8501250d ffffffa4
    Jun 17 17:37:47 amd kernel: [  396.796535] NVRM: RmInitAdapter failed! (0x53:0xffff:1908)
    Jun 17 17:37:47 amd kernel: [  396.796598] NVRM: rm_init_adapter failed for device bearing minor number 0
    Jun 17 17:37:55 amd kernel: [  404.160090] NVRM: GPU at PCI:0000:0b:00: GPU-ee0a5b68-20ee-5367-7e79-eb82b3751291
    Jun 17 17:37:55 amd kernel: [  404.160095] NVRM: Xid (PCI:0000:0b:00): 79, GPU has fallen off the bus.
    Jun 17 17:37:55 amd kernel: [  404.160097] NVRM: GPU at 0000:0b:00.0 has fallen off the bus.
    Jun 17 17:37:55 amd kernel: [  404.160690] NVRM: A GPU crash dump has been created. If possible, please run
    Jun 17 17:37:55 amd kernel: [  404.160690] NVRM: nvidia-bug-report.sh as root to collect this data before
    Jun 17 17:37:55 amd kernel: [  404.160690] NVRM: the NVIDIA kernel module is unloaded.
    Jun 17 17:37:59 amd kernel: [  408.405746] NVRM: RmInitAdapter failed! (0x12:0x45:1871)
    Jun 17 17:37:59 amd kernel: [  408.405825] NVRM: rm_init_adapter failed for device bearing minor number 0
    Jun 17 17:37:59 amd kernel: [  408.406310] NVRM: request_irq() failed (-22)
    Jun 17 17:38:00 amd kernel: [  409.034162] NVRM: request_irq() failed (-22)
    Jun 17 17:38:00 amd kernel: [  409.034270] NVRM: request_irq() failed (-22)
    Jun 17 17:38:03 amd kernel: [  412.041377] NVRM: request_irq() failed (-22)
    Jun 17 17:38:03 amd kernel: [  412.041653] NVRM: request_irq() failed (-22)

What could that mean? Is this ASUS mainboard / UEFI issues? These two cards are incompatible (770 is too old)? Or could this be NVIDIA driver or Linux kernel issue? I’m not really sure where to search for help.

Any comments are highly appreciated!

It’s a general problem with Kepler gpus and the X399 chipset, some vendors have bios updates for that. If not, see this:
https://devtalk.nvidia.com/default/topic/1024725/?comment=5214203
in short, turn off aer using pci=noaer kernel parameter and turn pcie to gen2 in bios.

After selecting GEN2 in UEFI and adding kernel parameter, my friend now can render with two cards, thanks!