I’d say it’s a hw failure, 3 cards seem to be fine one is broken. On Feb 3rd, you ran into that:
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189659] NVRM: GPU at PCI:0000:67:00: GPU-c5069979-608a-2f17-f773-1865a9c092c6
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189665] NVRM: GPU Board Serial Number:
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189671] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 0): Out Of Range Register
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189687] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 0, SM 0): Multiple Warp Errors
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189700] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x504730=0xd000d 0x504734=0x4 0x504728=0x4c1eb72 0x50472c=0x174
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189812] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 1): Out Of Range Register
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189821] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 0, SM 1): Multiple Warp Errors
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189830] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x5047b0=0x1000d 0x5047b4=0x24 0x5047a8=0x4c1eb72 0x5047ac=0x174
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189959] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 1, SM 0): Out Of Range Register
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189974] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 1, SM 0): Multiple Warp Errors
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189985] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x504f30=0xc000d 0x504f34=0x24 0x504f28=0x4c1eb72 0x504f2c=0x174
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190090] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 1, SM 1): Out Of Range Register
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190101] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 1, SM 1): Multiple Warp Errors
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190109] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x504fb0=0x7000d 0x504fb4=0x24 0x504fa8=0x4c1eb72 0x504fac=0x174
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190233] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 2, SM 0): Out Of Range Register
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190245] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 2, SM 0): Multiple Warp Errors
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190256] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x505730=0xc000d 0x505734=0x24 0x505728=0x4c1eb72 0x50572c=0x174
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190360] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 2, SM 1): Out Of Range Register
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190371] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 2, SM 1): Multiple Warp Errors
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190390] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x5057b0=0xd 0x5057b4=0x24 0x5057a8=0x4c1eb72 0x5057ac=0x174
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190524] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 3, SM 0): Out Of Range Register
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190536] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 3, SM 0): Multiple Warp Errors
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190545] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x505f30=0xe000d 0x505f34=0x24 0x505f28=0x4c1eb72 0x505f2c=0x174
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190651] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 3, SM 1): Out Of Range Register
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190660] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 3, SM 1): Multiple Warp Errors
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190668] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x505fb0=0x3000d 0x505fb4=0x24 0x505fa8=0x4c1eb72 0x505fac=0x174
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190781] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 4, SM 0): Out Of Range Register
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190791] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 4, SM 0): Multiple Warp Errors
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190801] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x506730=0xe000d 0x506734=0x24 0x506728=0x4c1eb72 0x50672c=0x174
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190895] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 4, SM 1): Out Of Range Register
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190907] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 4, SM 1): Multiple Warp Errors
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190918] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x5067b0=0x3000d 0x5067b4=0x24 0x5067a8=0x4c1eb72 0x5067ac=0x174
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.191042] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0, SM 0): Out Of Range Register
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.191055] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 1, TPC 0, SM 0): Multiple Warp Errors
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.191066] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x50c730=0xe000d 0x50c734=0x24 0x50c728=0x4c1eb72 0x50c72c=0x174
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.191162] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0, SM 1): Out Of Range Register
Feb 3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.191173] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 1, TPC 0, SM 1): Multiple Warp Errors
After that, the same gpu would error out 20 seconds after every boot:
Feb 3 20:05:49 DK-Analytics-Serveur1 kernel: [ 9.969666] NVRM: GPU at PCI:0000:67:00: GPU-c5069979-608a-2f17-f773-1865a9c092c6
Feb 3 20:05:49 DK-Analytics-Serveur1 kernel: [ 9.969668] NVRM: GPU Board Serial Number:
Feb 3 20:05:49 DK-Analytics-Serveur1 kernel: [ 9.969670] NVRM: Xid (PCI:0000:67:00): 38, 0001 0000902d 00000000 00000000 00000000
You have 4 gpus at pci
19
1a
67
68
You are using now the one at 68. The one at 67 is erroring out.
This is not necessarily the gpu but can also be a bad pcie connection, if possible, remove and reseat the cards in their sockets.
Please use this as /etc/xorg.conf:
Section "Device"
Identifier "nvidia"
Driver "nvidia"
VendorName "NVIDIA Corporation"
Option "ProbeAllGpus" "false"
BusID "PCI:104:0:0"
Option "AllowEmptyInitialConfiguration"
EndSection
This just uses the gpu you’re currently using for the Xserver. Then reconnect all gpus and check if you can boot.