Ubuntu 18.04 completely freezes after a few minutes of being booted

Hello,

I have a problem with Nvidia drivers on my ZBook Studio G5 x360 laptop running an Nvidia Quadro P1000 card.

I dual boot Windows and Ubuntu 18.04. Windows drivers work perfectly, but:

After installing Nvidia drivers, I first had a problem of being stuck in a login loop, but I solved it by adding a “nomodeset” option in grub Linux boot settings.

Now, every time I install ANY Nvidia driver (tried 435, 430, 390), through any means (Ubuntu’s proprietary drivers selection Software & Updates -> Additional drivers; as well as manually - adding the graphics ppa repository) in Ubuntu, the system completely freezes after a few minutes of being booted, requiring a hard-restart. This happens consistently after every boot.

The way to counteract this is to boot up and quickly go to Software & Updates -> Additional drivers and revert back to default Nouveau graphics drivers. When reverted, the system does not freeze anymore.

I am not sure what log files I could provide, please tell me if I can provide anything to help me solve this.

Thank you!

Please reinstall the nvidia driver, reboot, then run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/

Not OP, but having the same issue with the NVIDIA drivers (again, any including 435, 430, 390). I’m running Ubuntu 18.04 on a server with 2 GTX1080ti GPUs used as a deep learning dev box. Purging with

sudo apt purge 'nvidia*'

and rebooting solves the problem. I’m posting my output

nvidia-bug-report.sh

here. Thank you in advance for any and all help you can offer!

nvidia-bug-report.log.gz (2.7 MB)

Without the driver being loaded finding the error is not possible.

Thanks for your response! I installed nividia-430, then ran nvidia-bug-report.sh to generate the attached output. I purged after to keep the server up, but the driver was loaded at the time the bug report script was run.

The driver is only loading when nvidia-smi is run at the end of the log. No errors visible. Please reboot with the driver in place and create a new nvidia-bug-report.log afterwards.

We have been having similar issues on a number of preliminary systems that we are building for a staged program.

Hardware:
ADLINK COM Express Type 6 w/ 9th Intel Xeon E-2276ME and CM246 Chipset
ADLINK COM Express BASE-6 Carrier Board (ATX form factor carrier for module)
GPU: RTX 4000
RAM: 2x 8GB DDR4 2400MHz
OS: Ubuntu 16.04 AND/OR 18.04

We have built 4 identical system and, whether we are running Ubuntu 16.04 or 18.04, we can consistently get the system to lock up. It is not load related (have run Unigine Valley on multiple systems for days on end w/ zero issues). But the systems lock up after simple basic actions, like opening Chrome or Firefox and scrolling around.
Note: Interestingly, we loaded Windows 10 Pro onto one of the systems and, after several days, have had zero issues.

Here’s a snippet of syslog after the freeze (we are able to ssh into the systems after they lock up):

Oct 24 14:58:39 pimento2-ctrl03 systemd[1]: Started ACPI event daemon.
Oct 24 14:58:39 pimento2-ctrl03 systemd[1]: Started CUPS Scheduler.
Oct 24 15:17:01 pimento2-ctrl03 CRON[7244]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Oct 24 16:04:38 pimento2-ctrl03 kernel: [ 9127.647696] NVRM: GPU at PCI:0000:01:00: GPU-36bcf1ad-9320-692d-451f-506bd6434c2a
Oct 24 16:04:38 pimento2-ctrl03 kernel: [ 9127.647783] NVRM: GPU Board Serial Number: ���������������������������������������������������������������������������������������0{žw��ŀ
Oct 24 16:04:38 pimento2-ctrl03 kernel: [ 9127.647786] NVRM: Xid (PCI:0000:01:00): 79, pid=1003, GPU has fallen off the bus.
Oct 24 16:04:38 pimento2-ctrl03 kernel: [ 9127.647789] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Oct 24 16:04:38 pimento2-ctrl03 kernel: [ 9127.647831] NVRM: GPU 0000:01:00.0: GPU is on Board ���������������������������������������������������������������������������������������.
Oct 24 16:04:38 pimento2-ctrl03 kernel: [ 9127.647851] NVRM: A GPU crash dump has been created. If possible, please run
Oct 24 16:04:38 pimento2-ctrl03 kernel: [ 9127.647851] NVRM: nvidia-bug-report.sh as root to collect this data before
Oct 24 16:04:38 pimento2-ctrl03 kernel: [ 9127.647851] NVRM: the NVIDIA kernel module is unloaded.

Once I figure out how to add attachments, I’ll append 2 recent dumps from one of the systems :)

nvidia-bug-report.log.gz (960 KB)
nvidia-bug-report.log.gz (971 KB)

Normally, XID 79 would point towards overheating or insufficient/broken psu but since this doesn’t happen under load, I’d rather suspect a bios pm issue. Please check for a bios update. You could also try the kernel parameter
intel_idle.max_cstate=1
and check if that resolves the issue.

Thanks for the quick reply.

I totally forgot to mention that I had actually tried limiting/disabling cstates also :-(

  • Test 1: Disabled CStates in BIOS- Still able to reproduce freeze issue
  • Test 2: added intel_idle.max_cstate=1- Still able to reproduce freeze issue

The logs that I had attached were from the system with cstates disabled in the BIOS.

Oh, also, we had a 1050 Ti in the office, so I installed that and was unable to create the failure after 2 days of testing. Today, I’m swapping the 1050 Ti back in again to attempt to reproduce the failure.

Another team we are working with was using a 1080 Ti and didn’t see the issue (testing their algorithms for a half a day or so), but when they swapped in the RTX 4000, the system locked up in the first couple of minutes of use, before they were able to start testing their algorithms.

We do have ADLINK looking at the issue, as well. But in every instance, there is a GPU crash, but so far it is only when we are using the RTX 4000 combined with Ubuntu 16.04 LTS or 18.04 LTS

Then this rather sounds like power problems, not the prolonged maximum wattage under load but the peaks on throttling up that can get quite high with rtx:
https://devtalk.nvidia.com/default/topic/1049249/linux/quadro-rtx-6000-causes-hpe-server-to-power-off-peaks-way-over-power-limit-/post/5325050/#5325050

Thanks generix…

My coworker was wondering about that, though we never actually monitored the power states before/during failure.
We were using Seasonic Focus Plus 550W Platinum (SSR-550PX) supplies in our builds. He recommended we try something a little beefier, so we replaced the supply in one system with a Seasonic PRIME 850W (Prime PX-850), but were still able to get the same freeze under the same ‘general use’ conditions. (Note: we are only using a single GPU)

But, let me take a look at capping the clock and/or power limits, per that thread, and give an update later today.

I have the same problem on integrated graphics GF8300. Ubuntu 18 kernel 5.0 freezes when Nvidia drivers work on the 8300 GF. After loading X, and with any change on the screen, the system freezes. Moterboard asus m3n78 pro last bios.
nvidia-bug-report.log.gz (95.3 KB)

i got the same problems but after updating the BIOS , the freeze disapear until now .

I am getting the same error.
I just got a HP ZBook 360 with Nvidia Quadro P2000 and intel core i9 9 Gen.
The screen freeze after some time. I can not do anything.
Also if I lock the laptop and leave it to go in sleep mode it will not wake up. The only key working is the keyboard backlit. That key can turn on or off the keyboard backlit but other keys not working at all (CAPS LOCK light not working).
I have installed the Nvidia drivers three times until now and no luck (firs time I installed by using ubuntu -> additional drivers. Second time I installed by adding repository for graphics. And also I have installed by downloading from nvidia.
I have edited the grub to include nomodeset.
I also have changed 2 files on /usr/share/X11/ to disable the touchscreeen and the pencil.
None is working.
I do not know what to do more. Attached is the output from the script.nvidia-bug-report.log (1.5 MB)

Please try using kernel parameter
intel_idle.max_cstate=1

1 Like

I have the same issue with HP 450 G7 using nVidia MX250 GPU, Ubuntu 20.04, and what I have discovered is next: freezing issue occurs when GPU changes clock states, when in idle after reaching it’s lowest clock state and staying there for a about half of minute system locks. Temporary solution is to set Maximum Performance trough nVidia X Server Settings under PowerMizer option, just select Maximum Performance instead of Automatic/Adaptive, there is even an option to set this parameter in startup but disadvantage is you must login to system in order command to auto execute, if You not do so system will lock on login menu after few minutes or less when GPU reaches it’s lowest clock rate. I have tried Ubuntu 18.04 using 5.3 by default and everything works just fine, also I have tried Ubuntu 20.04 with 5.3 and still no issues so I concluded that combination of Kernel 5.4 and nVidia drivers 440 and 435 that I tested causing system to lock up when idle, when it’s in heavy usage and never reaching lowest state everything works fine, that explains the fact lock only occurs in non demanding tasks.