Random Xid 61 and Xorg lock-up

I has benchmarked the following two environments of Nvidia. I have used the GPU growth tactics to adapt to the DNN training for both the two environments.

gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

1st Scenario

CUDA Driver 440.100/CUDA Toolkit 10.2/cuDNN 7.6.5

In this environment, the GPU Perf is tied up to P8 for the typical DNN sample models such as LeNet, AlexNet, Inception v3. And the training speed is very slow. A typical AlexNet needs 270 minutes.Due to the long time training, the related changing factor is temperature shooting from 33 to 60+(in the air conditioner ambiance) or 80 Celsius degrees(no air conditioning ambiance).

The classical CUDA 440.33(originally for Nvidia Tesla GPU) has a similar effect. The training duration is same. Sometimes, CUDA Driver 440.33 has an quite odd incompatibility with CUDA Toolkit 10.2.

2nd Scenario

CUDA Driver 450.57/CUDA Toolkit 10.2/cuDNN 8.0.1

In the latest environment, the training speed is 22 minutes, 10+ times faster than the environment CUDA Driver 440.100. The GPU performance flexibly changes from P8 to P2 while it has a DNN training task. The temperature grows from 33 to 60+ in the air conditioner ambiance. It is just like a hungry cat meeting a poor mouse. It is an order of magnitude growth. I get to know that the CUDA 450.57 has MIG(Multiple Instance GPUs). Since it has the above-mentioned faster training speed, I guess that the latest driver version has a magic flexible capability.

However, the newer system has the CUPTI issue. Therefore, I need to run the sample AlexNet with the command.

Stand-alone Start in the Ubuntu Terminal:

$ python dnn.py --cap-add=CAP_SYS_ADMIN

or

Start in Jupyter Notebook:

Insert the code of lines into the last cell of Jupyter Notebook in order to remove the GPU process upon completing the training. Otherwise, it definitely has the issue of NVRM Xid 61/GPU FAN ERR & Pwr: Usage/CAP ERR .

from numba import cuda

cuda.select_device(0)
cuda.close()

I can observe the huge difference between the two CUDA environments because I have several Nvidia GPUs. So I migrate all my systems to the latest CUDA Driver 450.57. It is my observation.I hope that it will be useful for your thinking.

yes there’s perhaps no guarantee those numbers are right during the freeze. We don’t know. Since i tested quite many tweaks (snd_hda_intel, power_save_controller, SMT, “Typical Idle Usage” in BIOS, “options nvidia NVreg_RegisterForACPIEvents=1 NVreg_EnableMSI=1” in modprobe), each producing freeze with power range set, so far reducing memory clock by just 133Mhz from 3600 to 3466 (which was measured as twice as performant with min_fps metrics due to ideal 266Mhz divider, interestingly) had stopped the frequent madness. Let’s see if it’s some sort of luck. update: i tried disabling nvdia-smi tweak briefly and got Xid61 very quick so perhaps memory setting helps

The CUDA Driver 450.57 seems to be mismatch with the Ubuntu 18.04 LTS. Generally speaking, it presents a quite good status if CUDA Driver matches Ubuntu 18.04 LTS. I experience the emergency mode during trying to go back to the normal status. Please see the operation order as follows.

1. Start with the button of power

$ nvidia-smi
GPU FAN ERR! & Pwr: Usage/CAP ERR

$ dmesg -l err
[ 16.815765] pid=803, 0d02(31c4) 00000000 00000000

2. Restart

$ sudo shutdown -r now

$ nvidia-smi
GPU FAN ERR! & Pwr: Usage/CAP ERR

$ dmesg -l err
[ 16.815765] pid=841, 0d02(31c4) 00000000 00000000

3. Restart again

During the booting, there is the mesage as follows:

You are in the emergency mode. After logging in, type “journalctl -xb” to view systemlogs, “systemctl reboot” to reboot, “systemctl default” or “exit” to boot into default mode.
Press Enter for maintenance
(or press control-D to continue):

4. Shutdown and Restart

I could not operate the system according to the above-mentioned direction. Therefore, I ignore the above-mentioned warning and turn off and turn on the system.

Turn off the power button
Turn on the power button

Afterwards, it goes back to the normal status.

The CUDA Driver 450.57 sThe CUDA Driver 450.57 seems to be mismatch with the Ubuntu 18.04 LTS. Generally speaking, it presents a quite good status if CUDA Driver matches Ubuntu 18.04 LTS.eems to be mismatch with the Ubuntu 18.04 LTS. Generally speaking, it presents a quite good status if CUDA Driver matches Ubuntu 18.04 LTS.

Notes:

Because of the emergency issue incurred by the command of “$ sudo shutdown -r now”, I sometimes use the following composite commands. However, the commands could not make the system go back to the normal status sometimes.

$ sudo rmmod nvidia_uvm
$ sudo modprobe nvidia_uvm
$ sudo reboot

If meeting the problem, I reboot the system with the commands again. It most probably go back to the normal status.

I have reached an uptime of 20 days with minimum frequency of 1400 MHz without problems. I will need to reboot this weekend for some software updates, so this uptime will reset.

1 Like

Some news here. Both good and bad.

I tried setting performance mode, but that didn’t help much at all. Lockups didn’t seem to be helped. As time went on, it was happening more and more often. Between a few times a week to a couple times a day.

That said, I recently updated to driver version 450.57 and got an Xid error yesterday that did NOT cause a gpu slowdown/hang.

[273103.316484] NVRM: GPU at PCI:0000:0a:00: GPU-0818c2e5-6753-144d-0451-72136a530c72
[273103.316486] NVRM: GPU Board Serial Number: 
[273103.316489] NVRM: Xid (PCI:0000:0a:00): 61, pid=1003, 0d02(31c4) 00000000 00000000

It did however cause stuttering and hitching in factorio, not the most demanding of games graphically so i can only imagine what it’d do with heavier games. I have not rebooted since that happened, so can provide more info if requested.

Not sure it helps, but I did notice the gpu is running at 8x pcie, due to having a pcie->nvme card installed in the second pcie x16 slot. The board automatically bifurcates the main 16x pcie lanes into 2 8x connections if both slots are in use.

Same for me:
https://forums.developer.nvidia.com/t/new-ryzen-3950x-xid-errors-segfault/

Slight update. System’s been getting slower and slower over the past few days. So the problem isn’t gone. just not an immediate hard lock. Going to reboot.

living xid-free life for 3 weeks, thanks to nvidia-smi patch AND lowering main memory speed. With generous lowest frequency of just 600Mhz - best possible at performance level 1, adding only 3W.
Now I have time to solve tons of other Linux problems, like disappearing bluetooth, hdmi audio, resume issues, USB3 acting like USB2 and many more. Since switched to Linux, only troubleshooting. I should’ve done myself a favor and go for Intel platform at least.

Exactly two days I started to get Xid 61 errors, only I seem to have a completely different system.

My motherboard is ASUS TUF Gaming X570-Plus (Wi-Fi) which means it’s PCI-E 4.0.
My GPU is GTX 1660 Ti.

Once the error occurs, nvidia-smi starts malfunctioning:

nvidia-smi
Mon Sep  7 02:07:50 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 166...  Off  | 00000000:07:00.0  On |                  N/A |
|ERR!   50C    P5   ERR! / 130W |    837MiB /  5941MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3744      G   /usr/libexec/Xorg                 306MiB |
|    0   N/A  N/A      9377      G   ...AAAAAAAAA= --shared-files       60MiB |
|    0   N/A  N/A     44983      G   firefox                           468MiB |
+-----------------------------------------------------------------------------+

At this point I cannot set fan speed or check GPU temperature but the system keeps on working as if everything is OK. There are no errors logged to an X.org log file.

I’m running Fedora 32 with Linux 5.8.7 (vanilla). I haven’t changed anything in my system for the past year - it’s been rock solid so far, except for the past two days.

Hi birdie,

Please try with 450.66 driver and share results.

Hello Same on HP Omen 15-en0004AX
Spec Ryzen 7 4800H ,2x512gb Nvme drive, Nvidia GTX 1650 Ti, 16GB RAM
OS: Arch Kernel 5.8
Nvidia Diver: 450.66
Session KDE Plasma Xorg
None of the solution worked so far. I am getting this error when Ideling and on Regular usage (more frequently in Idelling). Suddenly CPU may got 100 percentage mouse will work for few seconds no commands or logging mentioned here will work on this state, after that entire system crash no ssh nothing, need to restart to recover.
Frequency hack mentioned here can delay it some what , But it very frequent for me the longest stretch i am able to use desktop is about 2:30 hours

Some logs from crash captured using kdump

There’s nothing in the release notes which could indicate that the bug has been solved but I’ll try anyways. Thank you.

Again, unlike most people here I have a motherboard based on the X570 chipset and my GPU runs via a PCI-E 3.0 interface.

Are the system crashes still being looked into? I keep getting xid 79 “GPU has fallen off the bus” now, usually after messages like:

pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:00:00.0
pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00000040/00006000
pcieport 0000:00:03.1: AER: [ 6] BadTLP

I’m on first-gen Ryzen with an X370 Taichi.

Hello Amrits,
Thanks for the 450.66 update. I have not encountered the xid61 error since I updated to this driver version 3-4 days ago.
I am on Mobo: ASUSTeK model: ROG STRIX X570 -F GAMING
AMD Ryzen 9 3950X
GeForce RTX 2080 SUPER
Regards

just as an update - I never had a single Xid 61 lockup on ubuntu with the clock hack we settled on,since I wrote about it in message 214 June 9.
I just installed the newer driver (450.66) and took off the hack - can confirm that the gpu clock is now running idle at much more sane speeds which is nice to be honest. I will report back if the new driver has fixed anything.

I also noticed from here:

that the additional information tab mentions some other workarounds for bugs I hadn’t heard of:


- Disable flipping in nvidia-settings (uncheck "Allow Flipping" in the "OpenGL Settings" panel)
- Disable UBB (run 'nvidia-xconfig --no-ubb')
- Use a composited desktop

are these relevant at all?

Update
Changing to xanmod kernal and updating grub with kernal parameter “pci=nommconf” fix this i think xanmod doesnt have any effect though.
For me PCIe bus fail on this issue rendering my system unusable.(no ssh , display sound hard disk mount wifi etc. It seems that everyone in the pcie bus goes missing when this happens but enabling “pci=nommconf” has done something which kind of fix this. ) Anybody have any idea why this configuration fix for me.

Hi elialbert,
thanks for coming back and reporting! I also installed 450.66 now and disabled our clock-fix workaround. I will let it run now for some time to see if the Xid-61 occurs again.

Thanks vinuvnair and elialbert for the update.

@Uli1234
Will await for your test results, thanks.

I am having the same issue on a B450 board too. I was having 6 to 8 crashes every day. No issues on the other OS. For me this:

nvidia-smi -pm ENABLED; nvidia-smi -lgc 1000,1815

made it so I did not have a crash in the last 3 days. Thank you very, very much… It was beginning to drive me crazy.

OS: Pop!_OS 20.04 LTS
MB: B450 GAMING PRO CARBON AC
CPU: R5 3600
GPU: RTX 2080 FE

1 Like

Any confirmation on if this “fix” was pushed to the WIndows Driver?

Not had the issue myself on WIndows since locking the card at Prefer Maximum Performance, but that also locks my card at maximum frequency 24/7 which also means constant heat.