Titan V, Ubuntu 16.04LTS and 387.34 driver crashes badly

Got it. Hate doing .run installs, but did it anyway. Exact same behavior. I’m torn between giving up and sticking with Windows, which at least works, or giving up and returning the board as unfit for use.

Can you check if Spread Spectrum is enabled in bios and disable it?
If that’s not helping, try using kernel parameters
clocksource=hpet lapic=notscdeadline
Edit: https://communities.intel.com/thread/119716

Basically the same problem here with Fedora 27, Titan V, and the 387.34 driver. It just hangs randomly and when I SSH to my desktop from my laptop and check top Xorg is using 100%. I attached to the process with gdb as root and got the following backtrace.

(gdb) bt
#0 0x00007f9bfc116877 in ioctl () from /lib64/libc.so.6
#1 0x00007f9bf691b7d1 in ?? () from /usr/lib64/xorg/modules/drivers/nvidia_drv.so
#2 0x00007f9bf691710a in ?? () from /usr/lib64/xorg/modules/drivers/nvidia_drv.so
#3 0x00007f9bf6919e79 in ?? () from /usr/lib64/xorg/modules/drivers/nvidia_drv.so
#4 0x00007f9bf68ad26b in ?? () from /usr/lib64/xorg/modules/drivers/nvidia_drv.so
#5 0x00007f9bf6e2ab44 in ?? () from /usr/lib64/xorg/modules/drivers/nvidia_drv.so
#6 0x00000000020e9660 in ?? ()
#7 0x00000000020f9278 in ?? ()
#8 0x00000000028bb720 in ?? ()
#9 0x00000020f6919e85 in ?? ()
#10 0x0000000000000080 in ?? ()
#11 0x000000000219f650 in ?? ()
#12 0x0000000000000553 in ?? ()
#13 0x000000000000049d in ?? ()
#14 0x000000000219b4c0 in ?? ()
#15 0x00000000020d85a0 in ?? ()
#16 0x00000000027219d0 in ?? ()
#17 0x00000000004ba187 in xf86ScreenSetCursor ()
#18 0x00000000004ba484 in xf86SetCursor ()
#19 0x00000000004b8ec0 in xf86CursorSetCursor ()
#20 0x000000000058580a in miPointerUpdateSprite ()
#21 0x0000000000585a5a in miPointerDisplayCursor ()
#22 0x00000000004c73f0 in CursorDisplayCursor ()
#23 0x0000000000516c06 in AnimCurTimerNotify ()

Beyond that it’s just more Xorg and timer stuff and probably not too useful. It would actually seem quite a coincidence now that I think about it that when the cursor starts to spin from some activity in Chrome this sometimes happens. It continues to spin actually (animates) even though everything else seems frozen.

@ework
What kind of system setup are you using?

It’s a custom built system using an X99 chipset. I’ll attach the lshw output.
lshw.txt (46.3 KB)

It just happened again and for the exact same reason. Nautilus was waiting to load a directory and it started to spin the mouse cursor. Strange thing is when I attached with gdb and then quit which detaches then everything comes back. So it’s waiting for something and gdb halting the process for a second seems to unblock it.

Not overly familiar (yet) with the ASROCK BIOS. Spread spectrum clock was set to “AUTO” which doesn’t tell me much. Was able to set it to no spread spectrum. Tried that. No difference. Added the kernel parameters - no difference.

This hang is an X server bug. There’s a thread about it from November but it looks like the patches never made it into the codebase. I’ll ping Keith to try to get them merged. https://lists.x.org/archives/xorg-devel/2017-November/055144.html

As you noted, pausing the X server for a moment breaks it out of this infinite recursion loop. Attaching GDB is one way to do that. You could probably also do something like “pkill -STOP Xorg; sleep 0.5; pkill -CONT Xorg” from a script. I realize this isn’t ideal, sorry.

The reason this shows up more on Titan V is simply because cursor updates take a hair longer than they do on earlier GPU architectures.

I had my doubts, but after testing, that does seem to be it. The script method isn’t going to work for me - constantly running the script from another computer every time the thing hangs isn’t practical - but it does give me hope for an eventual solution. This was a good catch!

Thanks Aaron. If I get a chance I’ll try applying that patch from the mailing list and rebuilding the xorg rpm for Fedora. If I get around to doing that and I go a few days without any hangs I’ll report back.

Now that I know what I’m looking for I can reproduce this problem very easily by opening my home directory using the Places menu in gnome-shell (with that add-on enabled of course) a number of times. If I repeat this action a few times quickly it always freezes up (usually by the second or third attempt). The delay for the command you provide must be too short in some cases and if I increase the delay such as in “pkill -STOP Xorg; sleep 1; pkill -CONT Xorg” it seems to work more often. Otherwise it unfreezes a little bit then freezes again or doesn’t unfreeze at all. I have the latest Xorg package (1.19.6) building now on Fedora 27 using mock so next I can attempt to add in the patches and test again.

It appears the maintainer of the package on Fedora is also the author of that patch you mentioned. Assuming that I can confirm it fixes the problem I’ll look into filing a redhat bug to have the patch included in the xorg-x11-server package for Fedora maybe until a new release is done.

Nope that didn’t work. I rebuilt xorg-x11-server with the patch series in the email and even fixed the free() call that a reviewer from NVIDIA mentioned caused a double-free when closing down the server. It just crashes xserver now when the cursor is changing. I believe the problem might be when the cursor is already animated and another scenarios occurs where it should be animated the second instance freezes it. I’m guessing maybe the pausing things works because it allows the first occurance to finish up so the busy cursor goes away and then the second animation can begin and complete. That’s at least my guess.

This patch is suppose to fix an infinite recursion case which is not what is happening here. If I look at the xserver git repo I see another fix that is related to an infinite recursion with the cursor that is already in the tree.

https://cgit.freedesktop.org/xorg/xserver/commit/?h=server-1.19-branch&id=4ef1aef0fbbf47c937cf421f0180cc18fc23a03e

I tried 1.19.6 and it behaves the same as 1.19.5 and doesn’t crash like it does with the patch applied. I didn’t try the SWcursor again with 1.19.6 but it doesn’t work on 1.19.5 and just disappears randomly most of the time.

Filed a Fedora bug to get more exposure.

https://bugzilla.redhat.com/show_bug.cgi?id=1531845

Determined to find a workaround besides connecting with my phone through SSH over and over and rerunning a command to unfreeze Xorg I went looking a bit deeper after the suggested patch failed. I can confirm the issue has to do with animated cursors. In fact it appears that once the animation makes a complete cycle you run into the problem. The cursor continues to animate until unfrozen but everything else is hung. For the Adwaita icon theme this is 1 second because the cursor is 60 frames with 16ms delay. So basically if you start gimp or nautilus or something that causes the “wait” icon to show up for more than a second you freeze. I spent quite a bit of time trying to figure out if there was a way to disable animated cursors in Xorg without any luck. I did end up figuring out how to modify the cursor so it’s only the first frame (aka static). Once I did that I could no longer reproduce the issue using the previously reliable method.

Modifying the cursor was tedious and I didn’t want to start again extracting the images, getting the hot spot, rebuilding the cursor file, etc for the second one. Since the animation now just pauses in place and kind of looks stuck I decided to just “disable” animated cursors by replacing them with the “default” pointer. The “wait” cursor is still useable (you can click) in a normal scenario to interact with the desktop so it’s not really as awkward as you might think. Until this gets fixed here is how I modified the Adwaita cursor theme.

$ cd /usr/share/icons/Adwaita/cursors
$ sudo rm watch
$ sudo ln -s default watch
$ sudo rm left_ptr_watch
$ sudo ln -s default left_ptr_watch

To later revert the change I can just reinstall the cursor theme (sudo dnf reinstall adwaita-cursor-theme). This change just makes the two animated cursor types the same as the normal pointer and hence not animated. After doing this and then rebooting just for good measures I can’t reproduce the problem anymore. This workaround I think may be more acceptable for the time being than running a command over SSH to prevent the frequent hangs. You won’t know when it’s in a busy state anymore, but normally that’s very short anyways.

You could write a cron script which would 1) either check for X.org CPU usage and run this command or 2) run this script unconditionally every 3 minutes, OR you could reassign your power button to run this script instead of shutting down your PC.

I think I figured out this crash problem and send a followup fix to the X.Org mailing list: https://lists.x.org/archives/xorg-devel/2018-January/055543.html

Could you please try that on top of Adam’s patches?

Sorry Aaron I didn’t get a chance to test your patch tonight but I will pull Adam’s patches from git and apply with yours on top of 1.19.6 and report back.

Adam provides a pair of patches as a counter-proposal, so those would be better to test:
https://lists.x.org/archives/xorg-devel/2018-January/055548.html
https://patchwork.freedesktop.org/series/36204/

I see he’s also committed them to the freedesktop.org xserver repo. I’ll format-patch those out along with the previous animcur stuff and give it a try.

Combining the 4 patches Adam originally created with the 2 new patches Adam created based on your work I no longer experience the issue. Before applying any patches 1.19.6 still had the issue. After applying these 6 patches to 1.19.6 the problem has gone away and I can now use animated cursors again. Thanks so much for looking into this and getting it resolved so quickly. I suppose we’ll have to wait for 1.19.7 for the fix to be generally available.