GPU has fallen off the bus

Hello,

Am not sure if I am in the right forum.

We have a system with 4x Titan RTX cards and we got the following error from one of the GPU after running some test. We have swap the GPU to a know good PCUe slot but the errot keeps following the GPU. We are running Ubuntu 18.04LTS.

$ dmesg | grep GPU
[ 8.074268] [drm] [nvidia-drm] [GPU ID 0x00001800] Loading driver
[ 8.074336] [drm] [nvidia-drm] [GPU ID 0x00003b00] Loading driver
[ 8.074405] [drm] [nvidia-drm] [GPU ID 0x00008600] Loading driver
[ 8.074470] [drm] [nvidia-drm] [GPU ID 0x0000af00] Loading driver
[46082.164201] NVRM: GPU at PCI:0000:86:00:
GPU-423058d0-0be5-89b4-cd0a-c9f03d793986
[46082.164206] NVRM: GPU Board Serial Number: 0422515065237
[46082.164209] NVRM: Xid (PCI:0000:86:00): 79, GPU has fallen off the bus.
[46082.164247] NVRM: GPU at 00000000:86:00.0 has fallen off the bus.
[46082.164248] NVRM: GPU is on Board 0422515065237.
[46083.477675] NVRM: A GPU crash dump has been created. If possible, please
run

$ nvidia-smi
Unable to determine the device handle for GPU 0000:86:00.0: GPU is lost.
Reboot the system to recover this GPU

Here is some other relevant info:

root@coconut:~# lspci | grep -i nvidia
18:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
18:00.1 Audio device: NVIDIA Corporation GM200 High Definition Audio (rev a1)
3b:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
3b:00.1 Audio device: NVIDIA Corporation GM200 High Definition Audio (rev a1)
86:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
86:00.1 Audio device: NVIDIA Corporation GM200 High Definition Audio (rev a1)
af:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
af:00.1 Audio device: NVIDIA Corporation GM200 High Definition Audio (rev a1)

root@coconut:~# nvidia-smi
Fri Sep 27 12:49:44 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT… Off | 00000000:18:00.0 Off | N/A |
| 22% 34C P8 15W / 250W | 11MiB / 12212MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX TIT… Off | 00000000:3B:00.0 Off | N/A |
| 22% 37C P8 15W / 250W | 11MiB / 12212MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 GeForce GTX TIT… Off | 00000000:86:00.0 Off | N/A |
| 22% 36C P8 15W / 250W | 11MiB / 12212MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 GeForce GTX TIT… Off | 00000000:AF:00.0 Off | N/A |
| 22% 35C P8 14W / 250W | 11MiB / 12212MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

root@coconut:~# nvidia-smi --query-gpu=index,name,uuid,serial --format=csv index, name, uuid, serial
index, name, uuid, serial
0, GeForce GTX TITAN X, GPU-4f8191fb-db04-e688-83a8-18015858ab82, 0422515065014
1, GeForce GTX TITAN X, GPU-daf248d1-a87f-ebd2-52f4-0054da4d4204, 0422915020942
2, GeForce GTX TITAN X, GPU-423058d0-0be5-89b4-cd0a-c9f03d793986, 0422515065237
3, GeForce GTX TITAN X, GPU-599c60fc-ea54-1840-a724-f843cfe32054, 0422915012801

root@coconut:~/sean# dmidecode --type 0

dmidecode 3.1

Getting SMBIOS data from sysfs.
SMBIOS 3.2.1 present.

SMBIOS implementations newer than version 3.1.1 are not

fully supported by this version of dmidecode.

Handle 0x0000, DMI type 0, 26 bytes
BIOS Information
Vendor: American Megatrends Inc.
Version: 3.0a
Release Date: 12/21/2018
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 32 MB
Characteristics:
PCI is supported
BIOS is upgradeable
BIOS shadowing is allowed
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
EDD is supported
5.25"/1.2 MB floppy services are supported (int 13h)
3.5"/720 kB floppy services are supported (int 13h)
3.5"/2.88 MB floppy services are supported (int 13h)
Print screen service is supported (int 5h)
Serial services are supported (int 14h)
Printer services are supported (int 17h)
ACPI is supported
USB legacy is supported
BIOS boot specification is supported
Targeted content distribution is supported
UEFI is supported
BIOS Revision: 5.14

Any help will be appreciated.