Laptop locks up starting X server with any driver after 313.26

Starting with driver version 313.30, my laptop completely locks up starting the X server. The lock up happens right when the NVIDIA splash screen is shown. It works without any problem with 313.26. I’ve tried every version of the driver that has been released since 313.26. I’ll attach the nvidia bug report log from 313.26 and what I was able to collect from the 313.30 install.

The lock-up appears to be complete. The only way I can recover is to hold down the power button for 10 seconds to shut the laptop off. The system log doesn’t contain any useful information (just garbage since the log appears to not get flushed to the disk successfully).

Distribution: Fedora 17
Desktop: KDE
Kernel: Several recent ones, the latest tested being 3.8.13-100.fc17.x86_64
Laptop: Samsung 700G7C
NVIDIA chip: 675M

This laptop is a little unique in that it has the 675M chip, but it is not using optimus. The NVIDIA chip is always on and wired directly to the display. That was one of the main reasons I bought this particular laptop since I bought it before optimus support was available in the NVIDIA driver. The laptop has been working with drivers all the way back to the 295 series until 313.30.
[This file was removed because it was flagged as potentially malicious] (103 KB)
nvidia-bug-report-313.30.log.gz (62 KB)

Could I please get a confirmation from someone from NVIDIA that this is going to be looked into?

Once again, could someone from NVIDIA comment on this issue?

And again, please respond to this NVIDIA people. If you need more information, I’ll do my best to provide it.

Tested with the 325.15 driver. The lockup still exists with that driver. So back to 313.26 again.

I have noticed one unusual thing in the system logs. With the working 313.26 drivers, I’m seeing these messages when the kernel module is inserted:

kernel: [ 14.300523] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=io+mem,decodes=none:owns=io+mem
kernel: [ 14.300697] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 313.26 Wed Feb 27 13:04:31 PST 2013

Which seems fairly normal. With 325.15 this is showing up:

kernel: [ 11.783279] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=io+mem,decodes=none:owns=io+mem
kernel: [ 11.783502] [drm] Initialized nvidia-drm 0.0.0 20130102 for 0000:01:00.0 on minor 0
kernel: [ 11.783508] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 325.15 Wed Jul 31 18:50:56 PDT 2013

Does anyone know why nvidia-drm shows up? Note that inserting the kernel module doesn’t lockup the machine. It locks up when attempting to start the x-server.

I’m afraid there’s nothing in your log file that indicates a problem. Does the system still lock up if you run a console-only application that initializes the GPU, for example, “nvidia-xconfig -o /tmp/xorg.conf --enable-all-gpus”? If so, maybe you can get a kernel crash message. You could also try setting up netconsole to log messages to a different machine.

The nvidia-drm line is expected: on sufficiently new kernels that’s how the driver communicates with other drivers for RandR 1.4 display offload support.

Running nvidia-xconfig as indicated did lock up the machine.

I tried getting netconsole working but didn’t have any luck getting it to actually send logs to the remote machine. I did remotely log in and “tail -f” the system log but it stopped updating when the machine locked up.

While trying to capture something in the kernel logs, starting the xserver using release 325.15 actually successfully started once instead of locking up. After it started up, capture the attached log. Unfortunately, I hadn’t started X with the logverbose option since I was expecting it to just lock up. When the X server successfully started, the only messages that showed up in the kernel logs at the exact time I started the X server were these:

Aug 16 19:50:01 dogbert dbus-daemon[912]: ** Message: No devices in use, exit
Aug 16 19:50:01 dogbert acpid: client connected from 2011[0:1000]
Aug 16 19:50:01 dogbert acpid: 1 client rule loaded
Aug 16 19:50:12 dogbert dbus-daemon[912]: dbus[912]: [system] Activating via systemd: service name=‘org.freedesktop.UPower’ unit=‘upower.service’
Aug 16 19:50:12 dogbert dbus[912]: [system] Activating via systemd: service name=‘org.freedesktop.UPower’ unit=‘upower.service’
Aug 16 19:50:12 dogbert systemd[1]: Cannot add dependency job for unit mdmonitor-takeover.service, ignoring: Unit mdmonitor-takeover.service failed to load: N
o such file or directory. See system logs and ‘systemctl status mdmonitor-takeover.service’ for details.

The acpid messages were the messages that popped up right when it usually locks up. Unfortunately, I couldn’t repeat getting it to start X without locking up.

I’m curious whether you regularly test with a laptop that uses a 675m with it directly driving the display instead of using optimus?

Thanks
nvidia-bug-report-325.15.log.gz (81.9 KB)

That’s too bad. I still don’t see anything going wrong in your bug report. One last-ditch effort might be to log in via SSH, stop the system logger, and then run “cat /proc/kmsg” That cuts out one of the middlemen between the kernel and the network stack, but it’s still not ideal. If that doesn’t work, maybe you can figure out how to get the kdump / crashkernel thing to work. I don’t have any experience with that, though.

I don’t think I have a laptop with a GeForce 675M in it.

I’ll try the /proc/kmsg method when I get some time.

I did modify the xorg.conf file to add the following option to the screen section in an attempt to rule out acpi issues:
Option “ConnectToAcpid” “0”
It still locked up.

I would hope that someone at NVIDIA has a laptop with a 675M that could be used to test with. This is definitely one of those times when having source code available to do a binary search for the change that introduced the bug would be useful.

I wonder if any other users can confirm whether they have a 675M that works with the newer drivers and whether it is an optimus or non-optimus setup?

A bit more information after some additional debugging. It does appear to be related to ACPI. I discovered that setting either of these parameters to the kernel allows 325.15 to start the X-server reliably:
acpi=off
pci=noacpi

Also tried the following and it continued to lock up:
acpi=noirq
pnpacpi=off
noapic

So, something with acpi handling changed between driver 313.26 and 313.30. Is that enough info to help track down the fix needed to the driver?

We have internal Bug “1287434:system hard freeze after X start on SamSung 700G7C system GeForce GTX 675M” to track this issue.

Thanks Sandip

Just to chime in that I’m also experiencing the same issue with 319.49 on Linux Mint 15 x64. I have the same model laptop. It hard froze the laptop during boot to the point that I had to boot off a live cd and edit the grub configuration to turn acpi off.

Meanwhile, this issue no longer repro with latest UBUNTHU 15.04 + R352 driver. Please test with latest os releases.