Ubuntu 18.04 with 4 RTX 2080 Ti boot issue & freeze & CUDA errors

Dear all,
I experienced for 2 months and a half now issue with my RTXs 2080Ti (from Palit) and Ubuntu 18.04. I really appreciate help from you ASAP.
Please see additional data on my configuration :

  • CPU : Intel 7920X
  • MB. : ASUS SAGE X299
  • RAM : 64 GB
  • HDD : SSD and Disk
  • Cooling : Liquid Cooling

Please see what I experienced :

  • With drivers 410.78 :
  • Ubuntu 18.04 boot ok until 3/4 boots → After no boot available with 4 cards. Boot possible with 3 GPUs power plug but not with 4 !!
  • With drivers 410.93 :
  • Same as 410.78
  • With drivers 415 :
  • Ubuntu 18.04 boot ok until 3/4 boots → After no boot available with 4 cards. NOK with 3 GPUs power plug
  • Ubuntu freeze during AI calculation
  • CUDA errors during calculation (not on all calculation)

At this time, I am no more able to boot my system.
Thank you for your Quick help ASAP.

Try to ssh into the box from another system.
Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
[url]https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/[/url]

I was not able to connect to my DevBox with ssh. So please find attached the resulting of Nvidia-bug-report under Try Ubuntu OS.
I don’t know if it will help you.

nvidia-bug-report.log.gz (60.3 KB)

Hello,
could you tell me if the file I attached yesterday is ok for you. Do you find something that could explain my issues ?
Thanks in advance

Not besides that there are 4 gpus in the system.
To get your otiginal system up and running, in the grub menu on boot hit ‘e’ to edit the kernel command line and append

systemd.unit=multi-user.target

to it. The system should then boot into command line, skipping X. You can then also run

systemctl disable display-manager

to disable X per default.
Then please use gpu-burn to test your gpus. Please create and provide an nvidia-bug-report.log afterwards.
If you have internet connection, you can use pastebinit to upload it from console.

  • install pastebinit (sudo apt install pastebinit)
  • unzip logfile (gunzip nvidia-bug-report.log.gz)
  • upload logfile (pastebinit -i nvidia-bug-report.log)
  • note down and post the url you’re given

Please also provide a log right after boot, maybe there are already errors visible.

Unfortunately, I am not able to do that. Indeed, my computer is freezing few seconds after system.unit=multi-user.target reboot.
I have the following message
[. 26.366014] NVRM: Xid (PCI:0000:67:00): 38, 0001 0000902d 00000000 00000000 00000000
Any advise ?
Thanks in advance

That sounds like the (primary) card can’t even keep up a vga console anymore, HW failure. You can only test every gpu one by one as single card in the box, using gpu-burn and then RMA any defective one.

Yes but as everything is simple, with a water cooling, it will be difficult to test card by card.
But 1 week ago I sent my de box to the store, thinking it was a defective card.
But under windows OS, everything was fine !
Just after that I updated the drivers and that worked during 3/4 boots … around 1 week.
It is really seems to be a driver issue

Unlikely. Works and then degrades and stops working isn’t a driver issue but a hardware one.
Try installing windows and stress-test the gpus there.

Thanks to 3 of the 4 cards power unplug, I am able to start my devbox.
So please find attached nvidia-bug-report file
Please could you tell asap if it is a driver or a hardware issue.
Thanks in advance
nvidia-bug-report.log.gz (1.95 MB)

I’d say it’s a hw failure, 3 cards seem to be fine one is broken. On Feb 3rd, you ran into that:

Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189659] NVRM: GPU at PCI:0000:67:00: GPU-c5069979-608a-2f17-f773-1865a9c092c6
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189665] NVRM: GPU Board Serial Number: 
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189671] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 0): Out Of Range Register
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189687] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 0, SM 0): Multiple Warp Errors
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189700] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x504730=0xd000d 0x504734=0x4 0x504728=0x4c1eb72 0x50472c=0x174
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189812] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 1): Out Of Range Register
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189821] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 0, SM 1): Multiple Warp Errors
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189830] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x5047b0=0x1000d 0x5047b4=0x24 0x5047a8=0x4c1eb72 0x5047ac=0x174
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189959] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 1, SM 0): Out Of Range Register
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189974] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 1, SM 0): Multiple Warp Errors
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.189985] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x504f30=0xc000d 0x504f34=0x24 0x504f28=0x4c1eb72 0x504f2c=0x174
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190090] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 1, SM 1): Out Of Range Register
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190101] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 1, SM 1): Multiple Warp Errors
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190109] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x504fb0=0x7000d 0x504fb4=0x24 0x504fa8=0x4c1eb72 0x504fac=0x174
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190233] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 2, SM 0): Out Of Range Register
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190245] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 2, SM 0): Multiple Warp Errors
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190256] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x505730=0xc000d 0x505734=0x24 0x505728=0x4c1eb72 0x50572c=0x174
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190360] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 2, SM 1): Out Of Range Register
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190371] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 2, SM 1): Multiple Warp Errors
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190390] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x5057b0=0xd 0x5057b4=0x24 0x5057a8=0x4c1eb72 0x5057ac=0x174
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190524] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 3, SM 0): Out Of Range Register
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190536] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 3, SM 0): Multiple Warp Errors
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190545] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x505f30=0xe000d 0x505f34=0x24 0x505f28=0x4c1eb72 0x505f2c=0x174
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190651] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 3, SM 1): Out Of Range Register
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190660] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 3, SM 1): Multiple Warp Errors
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190668] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x505fb0=0x3000d 0x505fb4=0x24 0x505fa8=0x4c1eb72 0x505fac=0x174
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190781] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 4, SM 0): Out Of Range Register
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190791] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 4, SM 0): Multiple Warp Errors
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190801] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x506730=0xe000d 0x506734=0x24 0x506728=0x4c1eb72 0x50672c=0x174
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190895] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 4, SM 1): Out Of Range Register
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190907] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 0, TPC 4, SM 1): Multiple Warp Errors
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.190918] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x5067b0=0x3000d 0x5067b4=0x24 0x5067a8=0x4c1eb72 0x5067ac=0x174
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.191042] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0, SM 0): Out Of Range Register
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.191055] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 1, TPC 0, SM 0): Multiple Warp Errors
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.191066] NVRM: Xid (PCI:0000:67:00): 13, Graphics Exception: ESR 0x50c730=0xe000d 0x50c734=0x24 0x50c728=0x4c1eb72 0x50c72c=0x174
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.191162] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0, SM 1): Out Of Range Register
Feb  3 11:34:41 DK-Analytics-Serveur1 kernel: [ 1820.191173] NVRM: Xid (PCI:0000:67:00): 13, Graphics SM Global Exception on (GPC 1, TPC 0, SM 1): Multiple Warp Errors

After that, the same gpu would error out 20 seconds after every boot:

Feb  3 20:05:49 DK-Analytics-Serveur1 kernel: [    9.969666] NVRM: GPU at PCI:0000:67:00: GPU-c5069979-608a-2f17-f773-1865a9c092c6
Feb  3 20:05:49 DK-Analytics-Serveur1 kernel: [    9.969668] NVRM: GPU Board Serial Number: 
Feb  3 20:05:49 DK-Analytics-Serveur1 kernel: [    9.969670] NVRM: Xid (PCI:0000:67:00): 38, 0001 0000902d 00000000 00000000 00000000

You have 4 gpus at pci
19
1a
67
68
You are using now the one at 68. The one at 67 is erroring out.
This is not necessarily the gpu but can also be a bad pcie connection, if possible, remove and reseat the cards in their sockets.
Please use this as /etc/xorg.conf:

Section "Device"
    Identifier     "nvidia"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    Option         "ProbeAllGpus" "false"
    BusID          "PCI:104:0:0"
    Option         "AllowEmptyInitialConfiguration"
EndSection

This just uses the gpu you’re currently using for the Xserver. Then reconnect all gpus and check if you can boot.

Hello,
the failured hardware has been replaced and everything was great until tonight.
Could you help to read the nvidia-bug-report.log.gz file and provide any advice.
Thanks in advance
nvidia-bug-report.log.gz (1.83 MB)

I couldn’t notice any errors but

[    7.808858] GPU does not have the necessary power cables connected.
[    7.809332] NVRM: RmInitAdapter failed! (0x25:0x1c:1076)

If you didn’t remove it on purpose, a power cable came loose.

I had to remove one cable to boot.
Thank you for your feedback

Do you have any advice to identify the reason if it is not the GPU ?
Thanks

What symptoms did you observe?

Also, try to scratch the xorg logs from journal:
sudo journalctl -b -1 --no-pager _COMM=gdm-x-session >xorg.log
-1 would be from the previous boot, so if your last boot with failure is 3 boots back, replace it with -3.

I have the same behavior as the one when the GPU was failed.
I have a black screen just before to get identification screen with the following message :
/dev/sdb2: clean, xxxx/yyy files, zzzz/ppppp blocks

Thanks in advance

could it be an issue with the Nvidia drivers ?