Random Xid 61 and Xorg lock-up

I certainly hope it’s not a placebo.

For months my machine rarely was stable for more than about 3 days. I’ve applied the fix (locked frequencies between 800 and 2130) and have been crash free for a couple of weeks now. knocks on wood

The card is consuming a considerable amount of additional power while the lowest 3 power stages are inaccessible, though.

I noticed around 660Mhz in your copy pasta, I went back up to 800Mhz minimum as I saw occasional dips down to PCIE1 below that.

The saga is happening on:
Ubuntu 20.04, AMD 3960X, ASRock Taichi, MSI RTX 2070 Super Gaming X Trio

Hi Guys:

It has the same Xid 61 issue with RTX 2070 Super on the ASUS motherboard.

After I bought the RTX 2070 Super on June 18, it kept working on almost 20+ days with no problem. Frankly speaking, I rarely used the GPU during the time.

While I run a deep learning model for about 20 minutes on CUDA10.1 and cuDNN 7.3, the RTX 2070 Super experienced the temperature of 80 Celsius degree. Since then, it has had the Xid 61 issue.

Due to the problem, I upgraded it to CUDA Driver 450.57 on the middle of this July. Since then, I have been regularly testing the application on the following environment. The error of Xid 61 emerges quite randomly during the 20 days.

1. Environment

Nvidia RTX 2070 Super
Ubuntu 18.04 LTS
CUDA Driver 4560.57
cuDNN 8.0.1
nvidia.persistenced.service

2. Issues.

$ dmesg -l err

[ 17.523739] NVRM: Xid (PCI:0000:01:00): 61, pid=819, 0d02(31c4) 00000000 00000000

$ nvidia-smi

| 0 GeForce RTX 207… On | 00000000:01:00.0 On | N/A |
|ERR! 32C P8 ERR! / 215W | 248MiB / 7981MiB | 0% Default

I check the Nvidia’s weblink to know that it is the Internal micro-controller breakpoint/warning (newer drivers)

https://docs.nvidia.com/deploy/xid-errors/

The error is quite annoying. I think that it is related to the combined factors including the CUDA driver, the thermal and the GPU itself. Hope that Nvidia can give the clear indication whether it is a hardware fault or system error.

Under the nvidia persistenced env, I set lock-gpu-clocks(lgc) to 1200, 2000 (min and max). The min clock is the P5 level in the nvidia-smi interface. And then I tested the system boot 20 times , of which 3 times of NVRM: Xid 61 emerged. One of the 3 times of NVRM Xid 61 followed the previous boot with the message of Recovering journal.

If the linux system failed to start, I used the following command to recover the system back to the normal status, otherwise it would have the issue of NVRM Xid 61 again.

$ sudo shutdown -r now

In inclusion, it is 15% failure rate in the test. However, it is much lower than the previous failure rate of the P8 level. In other words, the P8 setting in the nvidia-smi interface has much higher failure rate. It is the observed phenomenon: much more clock frequency(of the min), much less failure rate. The randomness still exists but is kept in the small range. Therefore it is a little more probably related to the CUDA Driver failure or even GPU hardware failure.

Before make the above-mentioned test, I permanently set the lgc in the persistenced as follow.

1. Open nvdia.persistenced.service

$ sudo gedit /lib/systemd/system/nvidia-persistenced.service

2. Make the content

Filename:

nvidia-persistenced.service

Content:

[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target
StopWhenUnneeded=true
Description=Locks minimum gpu clock to 1200

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --persistence-mode --verbose
ExecStart=/usr/bin/nvidia-smi -lgc 1200,2000
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced

[Install]
WantedBy=multi-user.target
RequiredBy=nvidia.service

3. Reboot

$ sudo reboot

Cheers.

good input. Since switching SMT, changing power_save_controller or turning off competing internal audio, underclocking cpu besides nvidia-smi power management didn’t do anything, i can now say the Xid 61 comes even in P0 state!:

GeForce RTX 2070 SUPER, 00000000:07:00.0, 440.100, P0, 3, 65, 1935 MHz, 7000 MHz, 173.89 W
GeForce RTX 2070 SUPER, 00000000:07:00.0, 440.100, P0, 3, 60, 1935 MHz, 7000 MHz, 84.64 W
GeForce RTX 2070 SUPER, 00000000:07:00.0, 440.100, P0, 3, 58, [Unknown Error], [Unknown Error], [Unknown Error]

happened right at the end of the benchmark. Perhaps the fix is to keep the card busy nonstop, run glxgears on the desktop?

Thanks for your SMT explaination. However, my test is not related to SMT. SMT might be not important for the Xid 61 error.

I have the second round of test after tuning the Ubuntu 18.04 system. In the previous test, it had a login issue of keyboard/mouse) incurred by the kernel update. After tuning the system, the second round of test is much more accurate.

1. Issues

After running the deep learning project for 20+ minutes with the high temperature of 80 Celsius degrees, it has been keeping the fault of both the ERR of NVRM: Xid 61 and the ERR of GPU Fan and Pwr: Usage/Cap.

$ nvidia-smi
GPU Fan(Percentage) ERR! & Pwr:Usage/Cap ERR!

$ dmesg -l err
NVRM: Xid 61…

My system only shows NVRM: Xid 61 error and no other error message after input the command of “$ dmesg -l err”. In other words, the Ubuntu system is kept in the sound operation.

2. Test result

Test counts: 20
Success times: 16
Times of NVRM: Xid 61 Err: 3
System Boot Failure: 1
Failure Rate with NVRM Xid 61: 15%

3. Details of Xid 61 Error

One of the three times of Xid 61: Boot with showing the Xid 61 message and could not enter into the desktop;
Two times of Xid 61 is in the continuous status, i.e, one failure followed by another failure. I use the following command to tune the system after generating the failure. For the ERR of Xid 61, it is necessary to reboot and then run a DNN project.

4. Environment

ASUS Motherboard
Nvidia RTX 2070 Super
Ubuntu 18.04 LTS
CUDA Driver 450.57
CUDA Toolkit 11.0
cuDNN 8.0.1
nvidia.persistenced.service
GPU Min/Max Clock Setting: 1200 ~ 2000 MHz
GPU Perf Level: P5 (defaulted)

5. Conclusion

The test result apparently shows the phenomenon that the Xid 61 error probability is precisely kept as 15% . I estimate it should be the fault of either CUDA Driver or GPU hardware. My GPU has the problem to deal with a high temperature for even a quite short DNN operating duration and has no mechanism to flexibly adjust the power level to adapt to the DNN project. As a result, it is necessary to reset nvidia.persistenced.service with the detailed parameters of min and max clock frequencies to 1200 and 2000. It is estimated that the higher min clock frequency such as 1400 might be much better for performance but be subject to the practical test result.

Notes:

The following composite commands might be more effective than the single command of “$ sudo shutdown -r now” before the reboot while having the problem of Xid 61. However, the single command can save my test time.

$ sudo shutdown -r now
$ sudo rmmod nvidia_uvm
$ sudo modprobe nvidia_uvm

What we don’t know here is, if at that moment when the error occured it was changing PCIe lanes/power modes, but because of the freeze it was not able to print the next line in the terminal window. Your terminal just showed that it was in state P0 before Xid 61 occured. You know what I mean?

Thanks for the input and quiet interesting. Does the error occur in your case when the graphics cars is getting very hot? Might be a hint that there is some internally power switching going on. All cards should have a safety mechanism to prevent overheating.

I’m sorry I shouldn’t have written that we have a root cause - we do not. We know that the PCIe Gen3->1 switch is what causes the problem, but we do not yet understand why. This is thought to be a platform bug and being investigated as such. I’ll update the thread as soon as I have more.
There will be some software changes to make the issue less likely to happen, but a real fix still depends on finding the root cause and that involves multiple vendors and a lot more investigation.

Maybe that’s why I’m able to reproduce the problem 100% of the time, because my motherboard is PCIe 1.0, so I’m not able to boot it with Nvidia’s driver, only with Nouveau. Also an AMD 5700xt works fine on that’s machine while running Folding@home, so it shouldn’t be a hardware issue. And I do realize that I can’t take full advantage of the 1660ti with an old computer like mine, but it should still be usable.

The problem on PCIe Gen 1 motherboards is very similar indeed, but that one is likely a driver bug. We are also working on it.

1 Like

I has benchmarked the following two environments of Nvidia. I have used the GPU growth tactics to adapt to the DNN training for both the two environments.

gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

1st Scenario

CUDA Driver 440.100/CUDA Toolkit 10.2/cuDNN 7.6.5

In this environment, the GPU Perf is tied up to P8 for the typical DNN sample models such as LeNet, AlexNet, Inception v3. And the training speed is very slow. A typical AlexNet needs 270 minutes.Due to the long time training, the related changing factor is temperature shooting from 33 to 60+(in the air conditioner ambiance) or 80 Celsius degrees(no air conditioning ambiance).

The classical CUDA 440.33(originally for Nvidia Tesla GPU) has a similar effect. The training duration is same. Sometimes, CUDA Driver 440.33 has an quite odd incompatibility with CUDA Toolkit 10.2.

2nd Scenario

CUDA Driver 450.57/CUDA Toolkit 10.2/cuDNN 8.0.1

In the latest environment, the training speed is 22 minutes, 10+ times faster than the environment CUDA Driver 440.100. The GPU performance flexibly changes from P8 to P2 while it has a DNN training task. The temperature grows from 33 to 60+ in the air conditioner ambiance. It is just like a hungry cat meeting a poor mouse. It is an order of magnitude growth. I get to know that the CUDA 450.57 has MIG(Multiple Instance GPUs). Since it has the above-mentioned faster training speed, I guess that the latest driver version has a magic flexible capability.

However, the newer system has the CUPTI issue. Therefore, I need to run the sample AlexNet with the command.

Stand-alone Start in the Ubuntu Terminal:

$ python dnn.py --cap-add=CAP_SYS_ADMIN

or

Start in Jupyter Notebook:

Insert the code of lines into the last cell of Jupyter Notebook in order to remove the GPU process upon completing the training. Otherwise, it definitely has the issue of NVRM Xid 61/GPU FAN ERR & Pwr: Usage/CAP ERR .

from numba import cuda

cuda.select_device(0)
cuda.close()

I can observe the huge difference between the two CUDA environments because I have several Nvidia GPUs. So I migrate all my systems to the latest CUDA Driver 450.57. It is my observation.I hope that it will be useful for your thinking.

yes there’s perhaps no guarantee those numbers are right during the freeze. We don’t know. Since i tested quite many tweaks (snd_hda_intel, power_save_controller, SMT, “Typical Idle Usage” in BIOS, “options nvidia NVreg_RegisterForACPIEvents=1 NVreg_EnableMSI=1” in modprobe), each producing freeze with power range set, so far reducing memory clock by just 133Mhz from 3600 to 3466 (which was measured as twice as performant with min_fps metrics due to ideal 266Mhz divider, interestingly) had stopped the frequent madness. Let’s see if it’s some sort of luck. update: i tried disabling nvdia-smi tweak briefly and got Xid61 very quick so perhaps memory setting helps

The CUDA Driver 450.57 seems to be mismatch with the Ubuntu 18.04 LTS. Generally speaking, it presents a quite good status if CUDA Driver matches Ubuntu 18.04 LTS. I experience the emergency mode during trying to go back to the normal status. Please see the operation order as follows.

1. Start with the button of power

$ nvidia-smi
GPU FAN ERR! & Pwr: Usage/CAP ERR

$ dmesg -l err
[ 16.815765] pid=803, 0d02(31c4) 00000000 00000000

2. Restart

$ sudo shutdown -r now

$ nvidia-smi
GPU FAN ERR! & Pwr: Usage/CAP ERR

$ dmesg -l err
[ 16.815765] pid=841, 0d02(31c4) 00000000 00000000

3. Restart again

During the booting, there is the mesage as follows:

You are in the emergency mode. After logging in, type “journalctl -xb” to view systemlogs, “systemctl reboot” to reboot, “systemctl default” or “exit” to boot into default mode.
Press Enter for maintenance
(or press control-D to continue):

4. Shutdown and Restart

I could not operate the system according to the above-mentioned direction. Therefore, I ignore the above-mentioned warning and turn off and turn on the system.

Turn off the power button
Turn on the power button

Afterwards, it goes back to the normal status.

The CUDA Driver 450.57 sThe CUDA Driver 450.57 seems to be mismatch with the Ubuntu 18.04 LTS. Generally speaking, it presents a quite good status if CUDA Driver matches Ubuntu 18.04 LTS.eems to be mismatch with the Ubuntu 18.04 LTS. Generally speaking, it presents a quite good status if CUDA Driver matches Ubuntu 18.04 LTS.

Notes:

Because of the emergency issue incurred by the command of “$ sudo shutdown -r now”, I sometimes use the following composite commands. However, the commands could not make the system go back to the normal status sometimes.

$ sudo rmmod nvidia_uvm
$ sudo modprobe nvidia_uvm
$ sudo reboot

If meeting the problem, I reboot the system with the commands again. It most probably go back to the normal status.

I have reached an uptime of 20 days with minimum frequency of 1400 MHz without problems. I will need to reboot this weekend for some software updates, so this uptime will reset.

1 Like

Some news here. Both good and bad.

I tried setting performance mode, but that didn’t help much at all. Lockups didn’t seem to be helped. As time went on, it was happening more and more often. Between a few times a week to a couple times a day.

That said, I recently updated to driver version 450.57 and got an Xid error yesterday that did NOT cause a gpu slowdown/hang.

[273103.316484] NVRM: GPU at PCI:0000:0a:00: GPU-0818c2e5-6753-144d-0451-72136a530c72
[273103.316486] NVRM: GPU Board Serial Number: 
[273103.316489] NVRM: Xid (PCI:0000:0a:00): 61, pid=1003, 0d02(31c4) 00000000 00000000

It did however cause stuttering and hitching in factorio, not the most demanding of games graphically so i can only imagine what it’d do with heavier games. I have not rebooted since that happened, so can provide more info if requested.

Not sure it helps, but I did notice the gpu is running at 8x pcie, due to having a pcie->nvme card installed in the second pcie x16 slot. The board automatically bifurcates the main 16x pcie lanes into 2 8x connections if both slots are in use.

Same for me:
https://forums.developer.nvidia.com/t/new-ryzen-3950x-xid-errors-segfault/

Slight update. System’s been getting slower and slower over the past few days. So the problem isn’t gone. just not an immediate hard lock. Going to reboot.

living xid-free life for 3 weeks, thanks to nvidia-smi patch AND lowering main memory speed. With generous lowest frequency of just 600Mhz - best possible at performance level 1, adding only 3W.
Now I have time to solve tons of other Linux problems, like disappearing bluetooth, hdmi audio, resume issues, USB3 acting like USB2 and many more. Since switched to Linux, only troubleshooting. I should’ve done myself a favor and go for Intel platform at least.

Exactly two days I started to get Xid 61 errors, only I seem to have a completely different system.

My motherboard is ASUS TUF Gaming X570-Plus (Wi-Fi) which means it’s PCI-E 4.0.
My GPU is GTX 1660 Ti.

Once the error occurs, nvidia-smi starts malfunctioning:

nvidia-smi
Mon Sep  7 02:07:50 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 166...  Off  | 00000000:07:00.0  On |                  N/A |
|ERR!   50C    P5   ERR! / 130W |    837MiB /  5941MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3744      G   /usr/libexec/Xorg                 306MiB |
|    0   N/A  N/A      9377      G   ...AAAAAAAAA= --shared-files       60MiB |
|    0   N/A  N/A     44983      G   firefox                           468MiB |
+-----------------------------------------------------------------------------+

At this point I cannot set fan speed or check GPU temperature but the system keeps on working as if everything is OK. There are no errors logged to an X.org log file.

I’m running Fedora 32 with Linux 5.8.7 (vanilla). I haven’t changed anything in my system for the past year - it’s been rock solid so far, except for the past two days.

Hi birdie,

Please try with 450.66 driver and share results.