Xid "Internal micro-controller halt" and device not found with Mobile GTX 1050

Dear all,

I have a Dell XPS 9560 laptop running Ubuntu 18.10 and up-to-date. Without apparent reasons, it seems as though I have lost access to the discrete GPU. I had been using drivers 396 successfully until now, and have upgraded to 415 to see if that would help me diagnose the problem, to no avail.

nvidia-smi does not see the device, I have manually managed to get it to show up again for a little while, by using PRIME to switch back and forth between the Intel GPU and the mobile GTX 1050, but the screen was full of red dots and chequered black squares with the Nvidia GPU, and unresponsive.

$ lspci | grep 3D
01:00.0 3D controller: NVIDIA Corporation GP107M [GeForce GTX 1050 Mobile] (rev a1)
$ nvidia-smi 
No devices were found

The relevant parts of the kernel messages are below

$ dmesg| grep -iE 'nvi|NVRM'
[    8.546940] nvidia: loading out-of-tree module taints kernel.
[    8.546950] nvidia: module license 'NVIDIA' taints kernel.
[    8.558484] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    8.568429] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[    8.690810] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  415.27  Thu Dec 20 17:25:03 CST 2018 (using threaded interrupts)
[    8.695220] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  415.27  Thu Dec 20 17:06:08 CST 2018
[    8.702188] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[   10.049082] NVRM: GPU at PCI:0000:01:00: GPU-aef5b373-444a-6e8d-f987-95d562a0e13a
[   10.049085] NVRM: Xid (PCI:0000:01:00): 62, 0a7c(2ab0) 00000000 00000000
[   30.604010] NVRM: RmInitAdapter failed! (0x53:0x65:1914)
[   30.604029] NVRM: rm_init_adapter failed for device bearing minor number 0
[   30.604073] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[   30.604245] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
[   30.618388] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 511
[   30.678619] NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
[   30.678648] NVRM: rm_init_adapter failed for device bearing minor number 0
[   30.972572] NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
[   30.972606] NVRM: rm_init_adapter failed for device bearing minor number 0
[   34.356478] NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
[   34.356528] NVRM: rm_init_adapter failed for device bearing minor number 0
[   36.389830] NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
[   36.389855] NVRM: rm_init_adapter failed for device bearing minor number 0
[   36.551046] NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
[   36.551070] NVRM: rm_init_adapter failed for device bearing minor number 0
[   36.713788] NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
[   36.713807] NVRM: rm_init_adapter failed for device bearing minor number 0
[   36.869363] NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
[   36.869396] NVRM: rm_init_adapter failed for device bearing minor number 0
[   37.031913] NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
[   37.031940] NVRM: rm_init_adapter failed for device bearing minor number 0
[   37.186996] NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
[   37.187016] NVRM: rm_init_adapter failed for device bearing minor number 0
[   37.337333] NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
[   37.337351] NVRM: rm_init_adapter failed for device bearing minor number 0
[   37.490194] NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
[   37.490216] NVRM: rm_init_adapter failed for device bearing minor number 0
[   37.665751] NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
[   37.665772] NVRM: rm_init_adapter failed for device bearing minor number 0

Each time I try to run nvidia-smi, a new dmesg appears as

NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
NVRM: rm_init_adapter failed for device bearing minor number 0

exactly like someone posted earlier here with a damaged 2080ti to RMA. I do not fully understand whether this NVRM message is also meaningful for mobile GPU’s.

The XID 62 code may apparently indicate a hardware or driver failure.

nvidia-bug-report.log.gz (543 KB)

That really sounds and looks like a hardware failure, happened on Feb 6th after a reboot. Did a kernel update occur back then? In that case, you could tra to downgrade the kernel, not much hope, though.

Dear generix,

Thank you so much! I was looking forward to reading your insight.

Based on the dpkg log, it looks like there was a minor kernel upgrade from 4.18.0-13 to 4.18.0-14, but on 5th February.

dpkg log (continuous extract, also attached):

...
2019-02-05 06:23:28 status installed linux-image-4.18.0-14-generic:amd64 4.18.0-14.15 
2019-02-06 20:25:16 startup archives unpack 
2019-02-06 20:25:16 upgrade e2fsprogs-l10n:all 1.44.4-2ubuntu0.1 1.44.4-2ubuntu0
...

On the other hand, the nvidia log shows that I could successfully initialise the nvidia-drm on 6th February at 8:46pm after that 5th February minor kernel upgrade:

nvidia log (continuous extract):

Feb  6 08:46:34 xps9560 kernel: [    3.292908] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Feb  6 08:46:34 xps9560 kernel: [    4.273974] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
Feb  6 08:46:34 xps9560 kernel: [    4.281053] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 235
Feb  6 14:03:30 xps9560 kernel: [    3.005851] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
Feb  6 14:03:30 xps9560 kernel: [    3.006156] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  396.54  Tue Aug 14 19:02:34 PDT 2018 (using threaded interrupts)
Feb  6 14:03:30 xps9560 kernel: [    3.056444] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  396.54  Tue Aug 14 23:08:44 PDT 2018
Feb  6 14:03:30 xps9560 kernel: [    3.071691] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Feb  6 14:03:30 xps9560 kernel: [    7.096433] NVRM: RmInitAdapter failed! (0x31:0xffff:842)

The “NVRM: RmInitAdapter failed” error only happened at 2:03pm, several hours later. What does “RmInitAdapter” stand for, by the way?
dpkg.log.gz (9.69 KB)

So nothing changed inbetween working and non-working state, broken hardware then.
RminitAdapter failed is an nvidia driver internal message, just telling that initializing hardware/firmware failed at some stage.

Yes, no change. I have tried booting kernel 4.18.0-13, and as expected, it made no difference; the GPU must be broken indeed…

I assume I would have to replace the Dell XPS 9560 motherboard, which will cost a lot… The hardware has failed two years and a month after purchase.

Thank you again for your support!

You should talk to Dell anyway, maybe they will repair it free of charge or at least at reduced cost if you ask nicely enough.

In case anyone was interested in knowing, it is indeed a good idea to chat with Dell.

I was offered a motherboard replacement for the XPS 15 9560 (i7-7700HQ & GTX 1050) for $320, including service. Talking a while longer, they offered to simply send the motherboard for $171, which is quite good (although I will need to send them back the motherboard with the broken GPU).

Thanks again for the support, generix!

Update: I have replaced the motherboard, after being frightened a bit by the BIOS update and Dell service tag request, I managed to boot again by switching the new BIOS to AHCI and enabling legacy ROMS (I had not installed Linux with an UEFI).

The new GPU is working perfectly, and I have no driver errors of any kind! Good diagnostic!