Hi tdb,
It would be great to have remote access of your system as I am still not able to replicate issue locally.
Hi tdb,
It would be great to have remote access of your system as I am still not able to replicate issue locally.
All right, I will set it up and let you know the next time it happens. Is a private message here a good way to communicate the details or would you prefer email?
Yes, we can communicate via private message
FYI hand another repro of this, this time only after 4 days. Not sure if it helps, but xorg and chrome go spinning at 100% when this occurs. If I restart the lightdm service, I get warnings in linux journal saying that the nvidia-modeset lost display notifications for GPU:0.
Hi All,
Sysadmin at a small 3D animation studio here. We recently purchased 30 new workstations with these specs.
System details:
Asus Prime X570-P
Ryzen 3900X 3.8Ghz
Asus Geforce RTX 2070 Super
64GB DDR4 (4X16)
Centos 7.5.1804
Kernel 5.4.0
Nvidia Driver 430.26
LightDM + Mate Desktop
The issues weāve been getting are exactly the same ones mentioned in this post (station freezes / Xid 61 lockup ). Issue happens randomly. Doesnāt seem to be a correlation between heavy GPU/CPU usage and issue occuring. Often our users would freeze while doing basic emails or browsing in chrome. Do note we have 1 of those workstations setup as a headless server (runlevel 3) and it hasnt frozen at all. Since its not interacting much with the GPU i guess its just not triggering a freeze. We can only reproduce the issue with stations in runlevel 5 + user interaction.
The troubleshooting steps we took :
With this kind of isolation done, its pretty safe to say the X-570 boards used with Turing cards are the cause of the issue. Since we have the budget we went ahead and replaced all our X-570 boards with MSI B450-Pros.
The uncertains :
It may not be an option for some you as changing board can be costly, but at least do know it is a working solution. Weāre happy to share more or answer questions if you have any.
sysadminfm9rx: iām experiencing this issue on a rog strix b450-f board with an athlon 3900x and a RTX 2080 ti.
@collinvandyck: The XID 61 freeze ? with a B450 board ? If so maybe its an ASUS thing on Gen 4 and 5 with Turing cards :( our MSI B450ās have been rock solid.
Yup, the same XID 61 freeze with a b450.
In my last post I changed the power settings to maximum performance. I hit XID 61 after only a few hours (Iāve been on vacation for the last two weeks.) It looked like chrome triggered it this time.
Iām out of ideas to try, so Iāve essentially ādisabledā hardware acceleration by starting a vncserver on a virtual display and running only vncviewer on the physical desktop. If this also triggers the issue, I will report back here.
Another repro of this just occurred for me. This time I barely achieved uptime of 1 day. So what else can I give to help diagnose this. I can grab a stack trace of xorg/lightdm if it would help and try to see if we can find where stuff is spinning. Or should we try pulling in some other vendor (though that is probably a bit premature).
Also not sure if it help. I am running a 3 monitor setup (all DP). And one of the monitors will not be recognized on system reboot (linux only, works fine with windows). I have to restart lightdm to get it to work. Also locking out the system (with lightdm) also causes the monitor to be lost. Is anyone experiencing this only using one monitor? Seems like this could be an issue that is more reproducible with 3 monitors (all DP).
@jm4games : On our end with the Asus X570ās, the issue was reproduceable with single and dual monitor setups (we didnt try triple). Numbers of displays or port type (HDMI, DVI, DP) didnāt seem to matter.
@sysadminfm9rx, Thats good to know, but also unfortunate, since it just makes this harder to track down :(
Maybe we should sticky a list of known hardware configurations where this reproduces.
@jm4games I had a similar issue with one of my DP monitors (I have two) not being recognized after Xid 61. Try powering everything down (pull the plug if some devices donāt have a hard power switch) and then back on. Might be that doing it for the monitor only is sufficient.
Looks like everyone affected has:
a Ryzen 3000 CPU
an NVIDIA RTX/GTX 16XX GPU
an X570 motherboard from either ASUS (most commonly) or Gigabyte (less common)
This should give NVIDIA a hint of what might be amiss.
Iāve also experienced this issue but only after enabling a non-standard nvidia kernel driver option, so my case is different I guess.
birdie: the problem also exists on b450 motherboards as well
To summarize affected configurations so far:
CPU Arch: Ryzen 3000 series (zen 2)
OS: Arch-Linux, debian 10, Ubuntu 18
Linux Kernels: 4.+, 5.+
Nvidia Drivers: 43*.+, 440.+
GPU: RTX Series, GTX 16**
Repo Time: 1hr ~ 28 days
Displays: 1+
Display Technology: DP, HDMI (seems more prominent on DP)
Mobo Chipset: X570, B450M
Mobo Venders: Asus, Gigabyte, MSI
XOrg Ver: 1.20.+
PCI-E: x16, x8
NOTE: I have seen at least 3 confirmations that GTX 1060 does not repo this issue (myself being one of them). We suspect this issue
is Turing architecture specific.
Please correct/suggest more for this summary.
You can exclude WM and DM as they are unlikely to have any effect.
Thereās no āGTX 2** Superā - you can just say RTX cards ;-) Super are just the same cards with faster VRAM.
I havenāt seen reports from MSI and some people claim MSI motherboards are definitely not affected ;-)
I have the same problem.
My System:
Linux Debian 5.3.0-3-amd64
GTX 1650
i7-3770
Mainboard: MSI B75MA-P45 (IntelĀ® B75)
driver version 440.44
KDE
Mainly apears while gaming.
I had GTX 1050 without any problems until i got the 1650.
I have a additional information,
itās mainly happen on special content. I play Dota and on some games i have no problems at all, but other will trigger this issue fast and i have to reboot often. I think it depends on some skins or heroes or something. When i watch the replay of the games where the issue occurs i will have the same problem again (so it is reproducible). It feels like the problem can be triggered by special reflections. But lower the video settings and disable things did not help much until now. Additional it stacks, the game starts to be laggy and if I have bad luck it continues until it crashes. In other cases it will not crash. More or less like some sort of memory leak. Killing the game will not make the system usable again, but the context switch makes the System complete useless. If i can issue a reboot command the system hangs for a while but will reboot in the most cases, after the reboot the system acts normally again.