Complete freeze with nvidia-prime

run a xps 15, i7700HQ, 9560 with Ubuntu 16.10 and “apt-get install nvidia-378”.
After an update of nvidia-378 on 2017-02-14 X did stop working. After debugging a while without success I reverted with “apt-get purge nvidia-378; apt-get install nvidia-375; reboot” and it works again.

Some dumb questions:

  • Do you have DKMS installed?
  • Did you reboot before starting X again?

I have a 9560 and it’s running 378 on 16.04. just fine.

Here are my → http://htx.webfactional.com/nvidia-logs.zip
cat /var/log/gpu-manager.log
cat /var/log/Xorg.0.log
lspci -v
cat /proc/acpi/bbswitch
cat /usr/lib/nvidia-378-prime/ld.so.conf
lsmod
dmesg

I’ve dropped the zip file on a personal space as I don’t seem to have an option to attach files here. [edit->] and also attached here.

This are snapshots after a normal boot on the Intel GPU where I’ve moved to my home the file /lib/systemd/system/nvidia-persistenced.service

The problem I’m seeing is with shutdowns or reboots. I’m apparently logged out from X but the laptop doesn’t turn off by itself. I’ve a blackscreen with filesystem status:

/dev/nvme0n1p7: recovering journal
/dev/nvme0n1p7: clean 500287/21094400 files, 21493115/84362496 blocks

That seems a message left from the last boot, I must always force the laptop to turn off with the power button so the filesystem is forcibly closed and the journal recovered.
There I can’t get a terminal with CTRL+ALT+F1, system is freezed.

nvidia-logs.zip (31.9 KB)

Downgraded to nvidia-375, same problem, while on intel GPU I must always force the laptop to turn off with the power button as it freezes.

Thanks for the logs.
You had a problem added when upgraded to 378.13 (from dmesg):

[    5.590376] NVRM: API mismatch: the client has the version 378.13, but
               NVRM: this kernel module has the version 378.09.  Please
               NVRM: make sure that this kernel module and all NVIDIA driver
               NVRM: components have the same version.

The kernel modules didn’t get updated. But that’s another thing. More of a question is, why is a client connecting at that time. Must be the xserver but then it’s starting too early. Nvidia modules get unloaded after that. And bbswitch fails to turn off the nvidia gpu.
cat /proc/acpi/bbswitch
gave you
0000:01:00.0 ON
while it should be OFF
Can you try to turn it off then,
echo OFF > /proc/acpi/bbswitch
and then check again with
cat /proc/acpi/bbswitch
If that works, try to turn it on again.
If turning off/on doesn’t work, this might be either a problem with bbswitch or ACPI of your computer.
Ever tried to suspend/resume when on nvidia?

Forgot: after every nvidia driver install/upgrade/downgrade you will have to remove the file /lib/systemd/system/nvidia-persistenced.service again. Having the persistence daemon started on module load will make debugging problems harder.

I’m trying with nvidia-375 and removed the /lib/systemd/system/nvidia-persistenced.service again.

htrex@OrionXPS:~$ cat /proc/acpi/bbswitch
0000:01:00.0 ON
htrex@OrionXPS:~$ sudo echo OFF > /proc/acpi/bbswitch
bash: /proc/acpi/bbswitch: Permission denied

edit: that’s on nvidia profile

That failes because of the ‘>’. Open a root shell first:
sudo -s
then try turning it off and on again. All this of course while on intel.

While on Intel, nvidia-375 drivers, /lib/systemd/system/nvidia-persistenced.service removed

htrex@OrionXPS:~$ sudo -s
[sudo] password for htrex:
root@OrionXPS:~# echo OFF > /proc/acpi/bbswitch
root@OrionXPS:~# cat /proc/acpi/bbswitch
0000:01:00.0 ON
root@OrionXPS:~# echo ON > /proc/acpi/bbswitch
root@OrionXPS:~# cat /proc/acpi/bbswitch
0000:01:00.0 ON

There seems to be the problem. Looks like the gpu enters some undefined state either by using bbswitch to turn it off or by unloading the nvidia modules. So when you shutdown the kernel hangs because it can’t power off the gpu.
Three (likely) possible bugs:

  1. bug in bbswitch
  2. bug in acpi
  3. bug in nvidia driver on unload
    To rule out the third possibility, please
    switch to nvidia using prime-select nvidia
    disable displaymanager using systemctl disable display-manager
    (don’t know if 16.04/10 uses display-manager or lightdm as the service)
    reboot
    After reboot, you should be on text console
    there unload nvidia drivers
    rmmod nvidia-uvm
    rmmod nvidia-drm
    rmmod nvidia-modeset
    rmmod nvidia
    (Edit: make sure, the driver is unloaded: lsmod |grep nvidia )
    Then reboot using systemctl reboot
    If it hangs then, this is a bug in the driver
    If it reboots cleanly, bug in bbswitch or acpi.
    You can get your desktop back using systemctl enable display-manager

I have the same problem.

Notebook ASUS ROG GL703VD

This notebook has intel/nvidia combo (both active).
Without nouveau and nvidia drivers everything works.
If I install nouveau (which was the default) or any nvidia driver so far, the notebook freezes on shutdown.
I agree that there might be a timing problem because this notebook is very fast… it takes a few SECONDS (like less than 10) to boot (from SSD and a default ubuntu 17.10 installation).

No solutions since February??

The problem seems related to NVIDIA driver unloading.
If I do prime-select intel
then reboot (it hangs because nvidia was selected).
Force poweroff with power button, then boot… then the system works and shut downs correctly…
If nvidia is selected (or nouveau) linux hangs at shutdown. no error.
My notebook has the latest bios by the way.
I didn’t try with other bioses because there is only one on asus website.

Try kernel parameter
acpi_osi=! acpi=“Windows 2009”
Report back with nvidia-bug-report.sh run and output attached.

The setting does not change the shutdown lockup but I found that the problem appears only with GDM3 (which is installed by default)
Everything works fine with lightdm.

I’m curious if anyone here runs Arch Linux, or better yet if the file /usr/lib/xorg/modules/input/mouse_drv.so exists on their system.

I came across something I never would’ve found otherwise by running bumblebee in the foreground with debugging enabled:

[ 3450.473647] [DEBUG][XORG] (II) Using input driver 'mouse' for '<default pointer>'
[ 3450.473652] [DEBUG][XORG] (**) Option "CorePointer" "on"
[ 3450.473657] [DEBUG][XORG] (**) <default pointer>: always reports core events
[ 3450.473663] [DEBUG][XORG] /usr/bin/X: symbol lookup error: /usr/lib/xorg/modules/input/mouse_drv.so: undefined symbol: xf86GetOS
[ 3450.484047] [DEBUG]Process with PID 1753 returned code 127
[ 3450.484084] [ERROR]X did not start properly
[ 3450.484202] [DEBUG]Socket closed.
^C[ 3454.082978] [WARN]Received Interrupt signal.
[ 3454.083027] [DEBUG]Socket closed.
[ 3454.083439] [DEBUG]Killing all remaining processes.

I know for me I got the impression my laptop had been locking up whenever I would try to run optirun but in all actuality it was due to the bumblebee service spamming my system logs using it’s default flag of --use-syslog:

May 18 03:12:36 c1-linuxdev bumblebeed[18682]: X did not start properly
May 18 03:12:36 c1-linuxdev bumblebeed[18682]: [XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.
May 18 03:12:36 c1-linuxdev bumblebeed[18682]: X did not start properly
May 18 03:12:37 c1-linuxdev bumblebeed[18682]: [XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.
May 18 03:12:37 c1-linuxdev bumblebeed[18682]: X did not start properly
May 18 03:12:37 c1-linuxdev bumblebeed[18682]: [XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.
May 18 03:12:37 c1-linuxdev bumblebeed[18682]: X did not start properly
May 18 03:12:37 c1-linuxdev bumblebeed[18682]: [XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.
May 18 03:12:37 c1-linuxdev bumblebeed[18682]: X did not start properly
May 18 03:12:38 c1-linuxdev bumblebeed[18682]: [XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.
May 18 03:12:38 c1-linuxdev bumblebeed[18682]: X did not start properly
May 18 03:12:38 c1-linuxdev bumblebeed[18682]: [XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.
May 18 03:12:38 c1-linuxdev bumblebeed[18682]: X did not start properly
May 18 03:12:38 c1-linuxdev bumblebeed[18682]: [XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.
May 18 03:12:38 c1-linuxdev bumblebeed[18682]: X did not start properly
May 18 03:12:39 c1-linuxdev bumblebeed[18682]: [XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.
May 18 03:12:39 c1-linuxdev bumblebeed[18682]: X did not start properly
May 18 03:12:39 c1-linuxdev bumblebeed[18682]: [XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.
May 18 03:12:39 c1-linuxdev bumblebeed[18682]: X did not start properly
May 18 03:12:39 c1-linuxdev bumblebeed[18682]: [XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.
May 18 03:12:39 c1-linuxdev bumblebeed[18682]: X did not start properly
May 18 03:12:40 c1-linuxdev bumblebeed[18682]: [XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.
May 18 03:12:40 c1-linuxdev bumblebeed[18682]: X did not start properly
May 18 03:12:40 c1-linuxdev bumblebeed[18682]: [XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.
May 18 03:12:40 c1-linuxdev bumblebeed[18682]: X did not start properly
May 18 03:12:40 c1-linuxdev bumblebeed[18682]: [XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.
May 18 03:12:40 c1-linuxdev bumblebeed[18682]: X did not start properly
May 18 03:12:41 c1-linuxdev bumblebeed[18682]: [XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.
May 18 03:12:41 c1-linuxdev bumblebeed[18682]: X did not start properly
May 18 03:12:41 c1-linuxdev bumblebeed[18682]: [XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.
May 18 03:12:41 c1-linuxdev bumblebeed[18682]: X did not start properly
May 18 03:12:41 c1-linuxdev bumblebeed[18682]: [XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.
May 18 03:12:41 c1-linuxdev bumblebeed[18682]: X did not start properly
May 18 03:12:42 c1-linuxdev bumblebeed[18682]: [XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.
May 18 03:12:42 c1-linuxdev bumblebeed[18682]: X did not start properly
May 18 03:12:42 c1-linuxdev bumblebeed[18682]: [XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.
May 18 03:12:42 c1-linuxdev bumblebeed[18682]: X did not start properly
May 18 03:12:42 c1-linuxdev bumblebeed[18682]: [XORG] (WW) NVIDIA(0): Unable to get display device for DPI computation.

It was a tricky one to catch since the service file gives the impression that it should be backing off every 60 seconds on failures but in actuality that’s only if the bumblebee daemon dies, not the Xorg binary it attempts to fork.

Anyway, removing the package that owned /usr/lib/xorg/modules/input/mouse_drv.so solved my issues – as a test perhaps you can temporarily move it out of the way to debug?

sudo mv -fv /usr/lib/xorg/modules/input/mouse_drv.so /usr/lib/xorg/modules/input/mouse_drv.so.bak

Hi zibri_,

Is there anything interesting from the failed shutdown in your system log after a reboot? Alternatively, if you can SSH into the system from a remote system and watch the output of “dmesg -w” while it tries to shut down, maybe it might catch something interesting.

I think generix is on the right track: if the problem reproduces with both nouveau and nvidia, then that’s it’s pretty likely to be a platform problem rather than a driver problem.

@thelambeers: The xf86GetOS function was removed in xserver 1.20, which Arch Linux just upgraded to recently. Whoever maintains the mouse_drv package needs to rebuild it against the new X server. That’s unlikely to be related to zibri_'s problem since xserver 1.20 just came out.