segfault makes system un-usable after a few moments ; segfaults in libmutter-clutter and libnvidia-glcore.so.390.77 ?

After a new system/build, about every 2 weeks or so, the system crashes. The crashing seems NVIDIA specific and not-related to windows. I’ve been dual-booting in ubuntu 18.04 and Windows and haven’t observed any issues in Windows, but just in Linux. After the crashes I look in /var/log/kern.log and see messages about a segfault. see below:

esalina@2018comp:~/07252018$ grep -C 5 -PHin 'segf'  kern.log  |grep -Pi 'jul.25'
kern.log-1068-Jul 25 19:16:22 2018comp kernel: [  948.860864] raid6: .... xor() 21793 MB/s, rmw enabled
kern.log-1069-Jul 25 19:16:22 2018comp kernel: [  948.860865] raid6: using avx512x2 recovery algorithm
kern.log-1070-Jul 25 19:16:22 2018comp kernel: [  948.861773] xor: automatically using best checksumming function   avx       
kern.log-1071-Jul 25 19:16:22 2018comp kernel: [  948.876150] Btrfs loaded, crc32c=crc32c-intel
kern.log-1072-Jul 25 19:48:42 2018comp kernel: [ 2895.638579] show_signal_msg: 31 callbacks suppressed
kern.log:1073:Jul 25 19:48:42 2018comp kernel: [ 2895.638581] gnome-shell[2188]: segfault at e0 ip 00007f06cc0d4de0 sp 00007fffc238b3f8 error 4 in libmutter-clutter-2.so[7f06cc043000+161000]
kern.log-1074-Jul 25 19:48:52 2018comp kernel: [ 2906.083150] rfkill: input handler enabled
kern.log-1075-Jul 25 19:48:59 2018comp kernel: [ 2912.980816] rfkill: input handler disabled
kern.log-1076-Jul 25 19:49:03 2018comp kernel: [ 2916.919708] rfkill: input handler enabled
kern.log:1077:Jul 25 19:49:09 2018comp kernel: [ 2922.664609] gnome-shell[6936]: segfault at 0 ip 00007f66ab3e7d5b sp 00007fff8c030940 error 6 in libnvidia-glcore.so.390.77[7f66aa1f8000+141f000]
kern.log-1078-Jul 25 19:50:16 2018comp kernel: [ 2990.201657] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD data byte 94
kern.log-1079-Jul 25 19:50:21 2018comp kernel: [ 2995.589211] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD data byte 26
kern.log-1080-Jul 25 19:50:27 2018comp kernel: [ 3001.368704] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD data byte 79
kern.log-1081-Jul 25 19:50:27 2018comp kernel: [ 3001.528954] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD data byte 4
kern.log-1082-Jul 25 19:50:27 2018comp kernel: [ 3001.638088] rfkill: input handler disabled
esalina@2018comp:~/07252018$

My concern about this is that I could be in the middle of doing something important and that the system would crash! Some immediate questions related to this forum/post: 1) I don’t see a way to attach a file (such as a bug report file), does a button appear to do so after I make this post? 2)
Also, after looking here https://devtalk.nvidia.com/default/topic/522835/linux/if-you-have-a-problem-please-read-this-first/ I want to run “startx” with “-- -logverbose 6”, can anyone advise me what file I should edit (Ubuntu 18.04) to edit the command-line of startx with verbose logging level 6?

nvidia-bug-report.log.gz (117 KB)

I say “half-usable”, because I tried clicking for example a web-browser (from “favorites” on the left-panel in gnome) but the browser didn’t start. But then, I tried to go to a terminal (by hitting Ctrl+Alt+F3), and then that didn’t work…so maybe I should have wrote 0-percent usable. I ended up rebooting last night after that instance of the issue. I edited the post title to reflect the system becoming unusable after a few moments. After the post, I found the attach file button.

Why are you using acpi=off?
Tried a 396 driver?

Hi Generix,

The short answer I’ve used “acpi=off” is because if I don’t then Linux will not boot. It hangs with an error message not unlike the one referenced in the link below.

This link and similar links gave me the idea to try “acpi=off” https://unix.stackexchange.com/questions/348806/acpi-exception-ae-not-found-infinitely-on-startup

I have tried the 396 driver yes and I have observed similar crashes. In fact in all cases with using/testing with a few recent nvidia drivers have I observed similar crashes.

I’ve used briefly the driver from the Ubuntu package repository but I don’t remember the name of it. I wonder if anyone knows if using that driver and not the official nvidia driver means that any functionality is lost or inaccessible? Such as CUDA programs? I only briefly used that package though and don’t think iv used it long enough to see if it would have similar crashes as I attempted to describe in my first post in this thread.

-Eddie

Running the nvidia driver with acpi turned off is not really a supported configuration. So instead of turning it off completely, just disable the gpe that spawns the initial issue, use kernel parameter acpi_mask_gpe
If the error message would be

[    0.922778] ACPI Exception: AE_NOT_FOUND, while evaluating GPE method [_L6F] (20150619/evgpe-592)

note the _L6F, the kernel parameter to use would be

acpi_mask_gpe=0x6f

Hello Generix,

So, this afternoon, I replied to you from my phone because I was at work an unable to sit at my home computer. That’s where I read your question about “why acpi=off”. To answer your question with further detail I provided a link because I couldn’t recall the exact error. It was however from that link or a very similar link where I got the idea to try “acpi=off” as I mentioned.

SO, I got home from work interested to boot without “acpi=off” and I edited the kernel configuration to be without “acpi=off” by hitting “e” after hitting “Ubuntu, with Linux 4.15.0-29-generic (recovery mode)”. I wanted to boot without “acpi=off” so that I could get an error message not unlike the one I linked to.

What actually happened is that after booting without acpi=off I did not get an error message!!! It seems that there has been a change in a recent kernel so that I no longer have to have “acpi=off” to successfully boot!!! This is great for me! I feel computationally alive!

esalina@2018comp:~$ cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-4.15.0-29-generic root=UUID=80ca43d3-f0d3-4a14-a08a-3b23d55fd682 ro quiet splash vt.handoff=1
esalina@2018comp:~$ ls -alht /boot/vmlinuz-4.15.0-2*
-rw------- 1 root root 7.9M Jul 17 11:26 /boot/vmlinuz-4.15.0-29-generic
-rw------- 1 root root 7.9M Jun 13 04:33 /boot/vmlinuz-4.15.0-24-generic
-rw------- 1 root root 7.9M May 23 13:49 /boot/vmlinuz-4.15.0-23-generic
esalina@2018comp:~$

Generix, I wonder what you think about this…!

THANK YOU Generix for making this happen!!!

-eddie

I didn’t do anything.
Keep in mind that the problem you were having is not necessarily connected with this but finding a bug while running in a standard fashion is easier.

Hello @Generix,

well, I think you are right that the problem I had is not necessarily connected with this. I definitely cannot dis-agree that bug-finding is easier when running in a standard fashion.

You are right that your replies did not stop me from taking away “acpi=off”. I could have done so at any time. But your posts/replies seemed to help trigger my taking acpi=off out. thank you!!!

As I recall I tried the unmodified kernel parameters and they didn’t let me boot and acpi=off let me boot. I think I tried without “acpi=off” after one kernel update, but apparently not with the most recent one I have (4.15.0-29-generic) but I had assumed that the one I’m running now would still have the issue!!! However, happily I am wrong!!

I am guessing/suspecting and hopeful though that running without acpi=off (ie the standard as you say) will fix the issue, but I will find out I suppose within the next few weeks. I have in the past observed the issue about once every two weeks or so!

With the kernel I’m running now and without acpi=off, two other things I observe that give me some additional suspicion that things will work without issues from here on (at least on this current kernel!) and hopefully on subsequent kernel updates.

  1. I get sound output from my NVIDIA card now. Previously I had to use the sound output on the motherboard because trying to get sound from the NVIDIA used to result in stuttering and horrible skipping which means any speech would be unintelligible
  2. without acpi=off, now I get hyperthreading (which seems disabled if acpi=off is there) on so in "System Monitor" I see double the cores now
  3. Powering off using ACPI works now. Previously only restarting worked and it was slow. Now, powering off works (without me having to hit the power button on the case) and also restarting works still but it works faster (no delay of 5-10 seconds)
  4. Also without acpi=off, I noticed that my resolution setting does not "fall out". When my GNOME starts up and I log in , my resolution defaults to 1920x1080. In my opinion, this setting makes my screen too bright, and my wallpaper doesn't look right. To address this issue, I open the "nvidia-settings" and set the resolution to 1680x1050 which looks better in my opinion. With acpi=off after doing this, I would observe that the screen blacks out and reverts to the old/previous resolution and that this would sometimes be accompanied by a segfault and the system becoming unusable (as I tried to describe at the first post in this thread.) So far I have not observed this happening!

A point I want to make explicit with this enumerated list is that everything listed that before was not working and that I wanted to work is currently working and as also working as I want it to work!

However, additional time (perhaps 2-4 weeks) will give me more assurance that things will work this way from here on (I hope!).

-eddie