Random Xid 61 and Xorg lock-up

Thanks for your SMT explaination. However, my test is not related to SMT. SMT might be not important for the Xid 61 error.

I have the second round of test after tuning the Ubuntu 18.04 system. In the previous test, it had a login issue of keyboard/mouse) incurred by the kernel update. After tuning the system, the second round of test is much more accurate.

1. Issues

After running the deep learning project for 20+ minutes with the high temperature of 80 Celsius degrees, it has been keeping the fault of both the ERR of NVRM: Xid 61 and the ERR of GPU Fan and Pwr: Usage/Cap.

$ nvidia-smi
GPU Fan(Percentage) ERR! & Pwr:Usage/Cap ERR!

$ dmesg -l err
NVRM: Xid 61…

My system only shows NVRM: Xid 61 error and no other error message after input the command of “$ dmesg -l err”. In other words, the Ubuntu system is kept in the sound operation.

2. Test result

Test counts: 20
Success times: 16
Times of NVRM: Xid 61 Err: 3
System Boot Failure: 1
Failure Rate with NVRM Xid 61: 15%

3. Details of Xid 61 Error

One of the three times of Xid 61: Boot with showing the Xid 61 message and could not enter into the desktop;
Two times of Xid 61 is in the continuous status, i.e, one failure followed by another failure. I use the following command to tune the system after generating the failure. For the ERR of Xid 61, it is necessary to reboot and then run a DNN project.

4. Environment

ASUS Motherboard
Nvidia RTX 2070 Super
Ubuntu 18.04 LTS
CUDA Driver 450.57
CUDA Toolkit 11.0
cuDNN 8.0.1
nvidia.persistenced.service
GPU Min/Max Clock Setting: 1200 ~ 2000 MHz
GPU Perf Level: P5 (defaulted)

5. Conclusion

The test result apparently shows the phenomenon that the Xid 61 error probability is precisely kept as 15% . I estimate it should be the fault of either CUDA Driver or GPU hardware. My GPU has the problem to deal with a high temperature for even a quite short DNN operating duration and has no mechanism to flexibly adjust the power level to adapt to the DNN project. As a result, it is necessary to reset nvidia.persistenced.service with the detailed parameters of min and max clock frequencies to 1200 and 2000. It is estimated that the higher min clock frequency such as 1400 might be much better for performance but be subject to the practical test result.

Notes:

The following composite commands might be more effective than the single command of “$ sudo shutdown -r now” before the reboot while having the problem of Xid 61. However, the single command can save my test time.

$ sudo shutdown -r now
$ sudo rmmod nvidia_uvm
$ sudo modprobe nvidia_uvm