Driver issue on Ubuntu 19.10

nvidia-bug-report.log (811.7 KB) Context: my laptop was running perfectly for months, then yesterday while I was playing a game on steam, it just rebooted all of a sudden and then after that it didn’t even recognized the NVidia drivers, I couldn’t see the graphics card drivers in the Software and Updates settings or ubuntu-drivers devices. I tried reainstalling the drivers and it it definitely wasn’t smooth, now I do see them but still have issues.

Laptop: Dell xps 9570 NVidia GTX 1050 ti

Some commands for more info:

[nvidia-bug-report.log|attachment](upload://8eaFX9G75fj8FLV5PB4Z27bRk9g.log) (811.7 KB) 
alex@alex:~$ ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:1c.0/0000:3b:00.0 ==
modalias : pci:v00008086d00002526sv00008086sd00000014bc02sc80i00
vendor   : Intel Corporation
model    : Wireless-AC 9260
manual_install: True
driver   : backport-iwlwifi-dkms - distro free

== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00001C8Csv00001028sd0000087Cbc03sc02i00
vendor   : NVIDIA Corporation
model    : GP107M [GeForce GTX 1050 Ti Mobile]
driver   : nvidia-driver-435 - distro non-free
driver   : nvidia-driver-440 - third-party free recommended
driver   : nvidia-driver-390 - third-party free
driver   : xserver-xorg-video-nouveau - distro free builtin


alex@alex:~$ sudo apt update && sudo apt upgrade -y
[sudo] password for alex: 
Hit:1 http://security.ubuntu.com/ubuntu eoan-security InRelease
Hit:2 http://ca.archive.ubuntu.com/ubuntu eoan InRelease                                                               
Ign:3 http://dl.google.com/linux/chrome/deb stable InRelease                                                           
Hit:4 https://download.docker.com/linux/ubuntu disco InRelease                                                         
Hit:5 http://ca.archive.ubuntu.com/ubuntu eoan-updates InRelease                                                       
Hit:6 http://ca.archive.ubuntu.com/ubuntu eoan-backports InRelease                                              
Hit:7 http://archive.canonical.com/ubuntu eoan InRelease                                                               
Hit:8 http://ppa.launchpad.net/gns3/ppa/ubuntu eoan InRelease                                                          
Hit:9 http://dl.google.com/linux/chrome/deb stable Release                                                             
Hit:10 http://repo.steampowered.com/steam precise InRelease                                                            
Hit:11 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu eoan InRelease                                  
Hit:12 https://deb.torproject.org/torproject.org eoan InRelease
Reading package lists... Done
Building dependency tree       
Reading state information... Done
1 package can be upgraded. Run 'apt list --upgradable' to see it.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Calculating upgrade... Done
The following packages were automatically installed and are no longer required:
  libnvidia-common-435 linux-headers-5.3.0-40 linux-headers-5.3.0-40-generic linux-image-5.3.0-40-generic
  linux-modules-5.3.0-40-generic linux-modules-extra-5.3.0-40-generic
Use 'sudo apt autoremove' to remove them.
The following packages will be upgraded:
  python3-keyring
1 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
2 not fully installed or removed.
Need to get 28.6 kB of archives.
After this operation, 2,048 B of additional disk space will be used.
Get:1 http://ca.archive.ubuntu.com/ubuntu eoan-updates/main amd64 python3-keyring all 18.0.1-1ubuntu1 [28.6 kB]
Fetched 28.6 kB in 0s (131 kB/s)         
(Reading database ... 289354 files and directories currently installed.)
Preparing to unpack .../python3-keyring_18.0.1-1ubuntu1_all.deb ...
Unpacking python3-keyring (18.0.1-1ubuntu1) over (18.0.1-1) ...
Setting up nvidia-dkms-440 (440.64-0ubuntu0~0.19.10.2) ...
update-initramfs: deferring update (trigger activated)
INFO:Enable nvidia
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/lenovo_thinkpad
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/put_your_quirks_here
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/dell_latitude
Removing old nvidia-440.64 DKMS files...

------------------------------
Deleting module version: 440.64
completely from the DKMS tree.
------------------------------
Done.
Loading new nvidia-440.64 DKMS files...
Building for 5.3.0-45-generic
Building for architecture x86_64
Building initial module for 5.3.0-45-generic
ERROR: Cannot create report: [Errno 17] File exists: '/var/crash/nvidia-dkms-440.0.crash'
Error! Application of patch disable_fstack-clash-protection_fcf-protection.patch failed.
Check /var/lib/dkms/nvidia/440.64/build/ for more information.
dpkg: error processing package nvidia-dkms-440 (--configure):
 installed nvidia-dkms-440 package post-installation script subprocess returned error exit status 6
Setting up python3-keyring (18.0.1-1ubuntu1) ...
dpkg: dependency problems prevent configuration of nvidia-driver-440:
 nvidia-driver-440 depends on nvidia-dkms-440 (= 440.64-0ubuntu0~0.19.10.2); however:
  Package nvidia-dkms-440 is not configured yet.

dpkg: error processing package nvidia-driver-440 (--configure):
 dependency problems - leaving unconfigured
No apport report written because the error message indicates its a followup error from a previous failure.
                                                                                                          Processing tri
ggers for man-db (2.8.7-3) ...
Processing triggers for initramfs-tools (0.133ubuntu10) ...
update-initramfs: Generating /boot/initrd.img-5.3.0-45-generic
I: The initramfs will attempt to resume from /dev/nvme0n1p4
I: (UUID=4ef37681-250c-4d75-954b-c8cf78fd5e65)
I: Set the RESUME variable to override this.
Errors were encountered while processing:
 nvidia-dkms-440
 nvidia-driver-440
E: Sub-process /usr/bin/dpkg returned an error code (1)
alex@alex:~$ 

alex@alex:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.[nvidia-bug-report.log|attachment]

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post. You will have to rename the file ending to something else since the forum software doesn’t accept .gz files (nifty!).

Just uploaded it in the post, thanksnvidia-bug-report.log (811.7 KB)

You have a mixed up driver from installing a .run installer over the packaged driver. Please change to an empty directory and run
sudo apt purge “nvidia*”
afterwards, run a .run installer again with --uninstall option.
Then reinstall the driver using the Software&Updates application.

Everything seemed to work without any errors, however I don’t think it did:

  • My game still only plays at ~10FPS when it worked at 60FPS with no issues

  • The NVidia X Server Settings still doens’t give me any options when the NVIDIA (Performance Mode) is selected , even after a reboot and prime-select also confirms that I’m on the NVIDIA profile.

  • I get the following result:

    alex@alex:~$ nvidia-smi
    Failed to initialize NVML: Driver/library version mismatch

Also should I use the 435 or 440 drivers? 440 shows as the preferred version so that’s what I got.

In case this can help:

alex@alex:~$ grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*
/etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb
/etc/modprobe.d/nvidia-drm.conf:options nvidia_drm modeset=1
/etc/modprobe.d/nvidia-drm-nomodeset.conf:options nvidia-drm modeset=1
/etc/modprobe.d/zz-nvidia-modeset.conf:options nvidia_drm modeset=1
/lib/modprobe.d/blacklist-nvidia.conf:# This file was generated by nvidia-prime
/lib/modprobe.d/blacklist-nvidia.conf:blacklist nvidia
/lib/modprobe.d/blacklist-nvidia.conf:blacklist nvidia-drm
/lib/modprobe.d/blacklist-nvidia.conf:blacklist nvidia-modeset
/lib/modprobe.d/blacklist-nvidia.conf:alias nvidia off
/lib/modprobe.d/blacklist-nvidia.conf:alias nvidia-drm off
/lib/modprobe.d/blacklist-nvidia.conf:alias nvidia-modeset off
/lib/modprobe.d/nvidia-kms.conf:# This file was generated by nvidia-prime
/lib/modprobe.d/nvidia-kms.conf:options nvidia-drm modeset=0

nvidia-bug-report.log (1.7 MB)

So I fixed everything by deleting the file:

/lib/modprobe.d/blacklist-nvidia.conf

and then sudo update-initramfs -u

However I tested the game again and it worked perfectly at 60FPS but did the same thing, it just rebooted after 5 minutes and now it’s like I don’t have any NVIDIA drivers again:

alex@alex:~$ grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*
/etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb
/etc/modprobe.d/nvidia-drm.conf:options nvidia_drm modeset=1
/etc/modprobe.d/nvidia-drm-nomodeset.conf:options nvidia-drm modeset=1
/etc/modprobe.d/zz-nvidia-modeset.conf:options nvidia_drm modeset=1
/lib/modprobe.d/nvidia-kms.conf:# This file was generated by nvidia-prime
/lib/modprobe.d/nvidia-kms.conf:options nvidia-drm modeset=0
alex@alex:~$ prime-select query
nvidia
alex@alex:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

alex@alex:~$ ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:1c.0/0000:3a:00.0 ==
modalias : pci:v00008086d00002526sv00008086sd00000014bc02sc80i00
vendor   : Intel Corporation
model    : Wireless-AC 9260
manual_install: True
driver   : backport-iwlwifi-dkms - distro free

alex@alex:~$ 

No clue what’s happening here, I played that games for hours before and never had an issue, actually didn’t have any issue with my system in months so I’m at a loss.

Latest logs:
nvidia-bug-report.log (1.7 MB)

There’s something wrong with your hardware, the whole pci bridge including the nvidia gpu went missing. Maybe some bios measure in case of overheating, maybe your hardware is simply broken. Try to remove the battery and power, hold down the power button for 20sec to discharge the mainboard, then reattach power cord and check if the nvidia gpu is visible again:
sudo lspci -d 10de:*

Looks like it made everything go back to normal:

alex@alex:~$ sudo lspci -d 10de:*
[sudo] password for alex: 
01:00.0 3D controller: NVIDIA Corporation GP107M [GeForce GTX 1050 Ti Mobile] (rev a1)
alex@alex:~$ prime-select query
nvidia
alex@alex:~$ nvidia-smi
Wed Apr  1 09:00:21 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64       Driver Version: 440.64       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   39C    P0    N/A /  N/A |    464MiB /  4042MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1651      G   /usr/lib/xorg/Xorg                            72MiB |
|    0      2805      G   /usr/lib/xorg/Xorg                           135MiB |
|    0      3072      G   /usr/bin/gnome-shell                         155MiB |
|    0      3524      G   ...AAAAAAAAAAAAAAgAAAAAAAAA --shared-files    50MiB |
+-----------------------------------------------------------------------------+
alex@alex:~$ ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:1c.0/0000:3b:00.0 ==
modalias : pci:v00008086d00002526sv00008086sd00000014bc02sc80i00
vendor   : Intel Corporation
model    : Wireless-AC 9260
manual_install: True
driver   : backport-iwlwifi-dkms - distro free

== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00001C8Csv00001028sd0000087Cbc03sc02i00
vendor   : NVIDIA Corporation
model    : GP107M [GeForce GTX 1050 Ti Mobile]
driver   : nvidia-driver-390 - third-party free
driver   : nvidia-driver-435 - distro non-free
driver   : nvidia-driver-440 - third-party free recommended
driver   : xserver-xorg-video-nouveau - distro free builtin

alex@alex:~$ 

I also cleaned it inside, but honestly it wasn’t really dusty.

I’m kind of clueless here, because I never had overheating issues before even though I know that the XPS 9750 is know for temp issues and thermal throttling.

Is there any way to adjust the thermal settings for the GPU? Or is it something that is done from the BIOS?

On notebooks, everything is done by the system bios. So there’s not really anything that can be tweaked driver-wise.
You could run nvidia-smi with -l and -f options to log the temperature to check if that’s really the issue.

1 Like

Is there anything weird in the logs?

gpu_logs.log (116 KB)

I feel like the GPU temp isn’t even that high to cause an issue…

83°C isn’t outright cool but also no reason to shut down completely. Are there any thresholds to be set in system bios?

Not that I can see, which I’m actually surprised of, however I found these in the BIOS events, I tried to look in the Ubuntu logs but I’m not sure I found anything that concerns the reboot I experienced and to be fair I saw some CPU related thermal events, but no GPU thermal events and in any case the Ubuntu and BIOS events didn’t have the same timestamp so I’m even less sure of what I saw.

The bios uses UTC, Ubuntu displays local time, 18:28 UTC=14:28 EDT, so it was a shutdown due to overheating.

I checked the hardware of your notebook and it’s a two-fan design. Those often need specific management software otherwise the fan-curve falls back to (most times crappy) bios defaults. I suspect the gpu fan doesn’t spin fast enough. Please see this on how to control your fans:
https://askubuntu.com/questions/1094485/dell-xps-15-9570-how-to-control-the-fans

Tried it but it didn’t smoothly, I had issues with the undervolt with msr until I tried with iuvolt, the it worked pretty weill for the rest but I feel like it’s worst than before.

I’ll reset the undervolt to 0 and see what happens.

So the laptop doesn’t reboot all of a sudden anymore, but the game just stops and closes after a minute. I played with configs quite a bit and did a lot of installs and uninstalls in the last few months so maybe the best at this point is a fresh OS install, thankfully Ubuntu 20.4 is out very soon.

Looks like you played with undervolting yout cpu which doesn’t really help with your situation. Did you check if a bios update is available? Did you manage to control your fan and e.g. set both to 100% to check if that keeps every thing running?