1.) Are you sure the system didn’t reboot during that 4 days? Would help putting the setting for PowerMizer in a startup script
2.) When the issue appeared again after 4 days, are you sure it’s your card #0 that triggerd the issue? Might be a try to apply the PowerMizerSetting to all three cards
3.) Instead of PowerMizer you could try to lock the frequencies as explained in this thread. For example 1000MHz-2000MHz. In my case when having a min. freq of 1000MHz the card always stays in P0.
@Uli1234 thanks for all of the work you’re doing to track down this issue
1.) I’m sure the system didn’t reboot during the 4 day period. I ran a simple script to monitor the situation and that script would not have been running if the system had been rebooted.
2.) I’m 90% sure it was still card #0 but I’ll check again next time to be sure. Just yesterday after I posted I saw that card #0 went to P8 briefly a couple hours after the PowerMizer setting was applied.
Just to follow up, the issue reappeared just now and it was still with card #0, with no intervening reboots. Going to try the frequency locking approach now.
Also fwiw, I think something is still glitchy with the GTX card based on the the fact that chrome now freezes up every now and then and I have to restart it. I suspect where the RTX card would cause the entire system to grind to a halt, the GTX is more stable and only the application in use is affected. Maybe something to do with the turing architecture code?
Have this issue on both Windows and Linux. As others have pointed out, it only seems to happen when the GPU goes into a low power state. Doesn’t seem to happen if forced to stay in low power, also doesn’t happen if kept in a high power state. Most reliable way I can reproduce it is by doing stuff that causes the GPU to flip flop in and out of low power, but even then, it still seems very random.
I locked the frequencies and set the powermizermode to max. Will let you guys know if it occurs again.
Let’s hope nvidia is making progress towards resolving this issue.
IMO, I don’t think clocks are related here. It only happens for me when the GPU is actively doing something, a little different than some here. It almost always happen when in a WebEx session (chrome using the GPU) or at least that’s when it’s the biggest problem. These days most of my apps are basically Chrome wrappers. Slack (Electron), Outlook (Chrome App), WebEx (Chrome website).
So after about2/3 weeks, I do not think I had a xid61 lock-up. However, I have had a different issue which may or may not be related to changing the minimum clock speed. Will investigate.
For me (system specs see above), I have an uptime of 32 days now, without any crashes. This is the first time this year I have an uptime of more than a week.
This is after sudo nvidia-smi -pm ENABLED; sudo nvidia-smi -lgc 1000,2115
Update: Had a system lockup the same day I set the power settings. Although I was able to get a shell open and shutdown gracefully (which I have not previously been able to do). Got the NVRM: Xi 61 log message right before it happened.
Thanks for the feedback. It seems setting the power settings is not as effective as locking the GPU frequencies. I guess it comes from the fact, that MaxPerformanceMode sets the preference to MaxPerformance but doesn’t prohibit the GPU to lower the clock frequencies. So maybe after some idle time MaxPerformanceMode also decreases those. Or it might be a temperature thing. When temperatures are to high in the GPU it will throttle down.
Sorry, I meant the frequency locking when I said “power setting” (conflated the two in my head since frequency locking impacts power). I’m hesitant to try the “Prefer performance” setting since it dramatically increases my computer’s power usage. I don’t really consider that to be an acceptable long-term fix.
Additionally disabling SMT did not work for me, and I still incurred an NVRM: Xid 61 message. Worth noting though is that the “lock up” this time was not particularly bad, only when launching a new GPU process did it stutter, but the system was otherwise quite usable. Still required a reboot to fix though.
Right now I’m testing disabled SMT alongside disabled XMP RAM settings.
If the XMP settings change doesn’t help I may try the performance setting, however I may just buy a replacement motherboard and hope that fixes it. The cost of running the GPU at performance settings would probably pay for a new motherboard by the end of the year!
It’s a shame that neither NVidia nor AMD have been able to diagnose this issue. Makes the computer nearly unusable for serious work.
Edit: I also wanted to mention that my GPU is plugged in via a riser cable (small form factor case). Seems unlikely that that would cause any issues, but again, more information (and maybe other people have theirs plugged in via riser and are noticing this issue too).