Been redirected here by support due to using Linux as a daily driver.
After replacing my ancient GTX 770 and with a new RTX 4060 I’ve been having numerous hard lockups.
The system occasional lock-ups when:
Hovering over a YouTube video thumbnail.
Playing a game while having YouTube running on another monitor.
Playing some specific games include play back video content.
I have a save file for Dyson Sphere Program that causes a hard freeze within 5 seconds of loading it, without any user input. Great for testing, terrible for fun. It also means its not a thermal issue because nothing has time to get hot.
I’ve monitored temperatures, CPU, PCI-e and GPU loads. Temperatures are low, loads are low to non-existent when the whole thing locks-up.
I performed a full memtest cycle which found no errors.
I performed a full CPU test which found no errors.
Minimum 550 watt power supply - My 900 watt power supply far exceeds that.
Latest chipset driver - Using latest kernel provided by distro.
Latest GPU driver - Using version 535.104.05, but experienced similar issues on 535.98.
I can only conclude that the GPU hardware and/or GPU driver is faulty.
I don’t have any further diagnostic information due to the nature of the problem.
Since I can reproduce this issue very reliably by loading a save file, this system makes for a great test dummy. I also have a second system I can use as rsyslog target. Problem is, I don’t know what information would be useful in solving the problem.
does it freeze app only or full screen with cursor, like this?
I don’t particular see the relation with the topic you linked. They are speaking of two laptops, one with RTX 4070 over HDMI and another with GTX 1660 Ti over HDMI/USB-C (to HDMI I suppose). I do not use a laptop and it has no IGP. Their freezes are also different (only the display locks up and stuttering until display locks up).
In my case the whole system is frozen. Pressing Num lock at this point does not toggle the keyboard LED meaning interrupts are no longer being processed by the kernel.
There is no mention about fullscreen vs. windowed mode in that topic either. Perhaps you wanted to link a different topic?
Its been about a month of multiple crashes daily. Half the games I own are no longer playable. Can’t even watch a video while browsing the web. No response, no suggestions, no fix, nothing. Very disappointing.
Upgraded kernel to 5.15.0-87. Upgraded driver to 535.113.01.
System still locks up frequently. Freezes have been “different” however since upgrade:
Able to load the Dyson Sphere Program savegame now that previously froze within 5 seconds and play for ~15-20 minutes before inevitably locking up.
I can play Factorio and Per Aspera for hours without issues.
Having YouTube (Chrome) playing on the secondary display however causes freezes.
Similarly, just reading Steam (Firefox) on the primary and having YouTube (Chrome) on the secondary causes freezes.
Starting to get the feeling that the issue is streaming data to VRAM while its busy rendering. This ancient chipset still has a front-side bus, no physical memory access. Just maybe, maybe driver isn’t waiting long enough for transfers to complete and triggering page faults in the memory controller?
Considering to swap my 4060 Ti with my son’s 3060 Ti. Performance-wise I’m CPU bound anyway.
Haven’t tried Windows to see if its a GPU hardware problem. While loading Windows on a USB drive and booting from it is possible, its annoying and very time-consuming to do.
Anyway, I see 535.129.03 is available. I suppose I can try that first. But considering two related threads below, its not very encouraging to try that particular driver version…
No. The system goes into a hard freeze. Pressing Num Lock no longer toggles the LED meaning regular interrupts are no longer being serviced. Hard freeze = no logs, nothing. Dead as a brick. All I can do is power cycle.
I’ve been trying to enable the NMI watchdog specifically designed for these kinds of situations in order to get a stack trace. Haven’t had much time yet to play with it tho.
I think I set the right kernel flags but I don’t think even half a stack trace will fit on 640x480 or whatever VESA resolution kernel will decide to use for printing. Considering this is a GPU/driver problem I’m not sure I’ll get to see anything at all to be honest.
It would be nice to get a kernel dump but setting up kdump is a bit challenging when you’ve never used it before. I happen to know a kernel dev and I want to ask him about it but he wasn’t in today.
Meanwhile, I’ve upgrading to kernel 5.15.0-88 and driver 535.129.03. So far it has made little difference.
With a 900 watt power supply, less than 30% GPU load on a motherboard with loads of overclocking features? Shouldn’t be. I could test that hypothesis with some CUDA workload but its extremely unlikely.
Before this I had a GTX 770 (230 watts), now a RTX 4060 Ti (160 watts). There should be no issues as power consumption went down with this upgrade.
Haven’t had much luck with the NMI watchdog, I’ll keep trying.
It’s not about total power or PSU but the mainboard’s pcie power. Specified for up to 75W but newer gpus are known to draw more. You could try to limit clocking to prevent the gpu from going into boost state
nvidia-smi -lgc 300,1500
and see if that helps.
I didn’t read your last comment until now and it took me a while to figure out how to do it because NVIDIA settings doesn’t let you lock the clock speeds nor set the power limits.
And it actually works. Chrome still manages to freeze the whole system occasionally but I’m able to play games now for extended periods that before would freeze everything within minutes.
Displayed GPU load appears to be relative, before it was 30% now it’s near 90%. Powermizer also appears to be worthless, it shows “performance levels” it wants to use, not what its actually running. nvidia-smi shows correct values, memory clock varies but maxes out at 9000 while GPU clock stays at 405. Card temperature hovers around 42C, fans remain idle. Reported power usage remains below 40 watts.
I haven’t tried other values yet, for now I’m just happy I can use my system again.
This does raise the question… clearly these cards are out-of-spec in regards to main board power draw. Various sources seem to suggest this is software-controlled. Will upgrading the VBIOS make a difference? It currently has 95.06.1A.40.66.
Finally went and replaced my 4060 TI with my son’s 3060 TI and all my problems went away. The occasional lock-up once every 2-3 weeks is manageable, multiple lock-ups daily are not.
Did not reinstall my driver, nor upgraded kernel, only swapped video cards. Don’t have to throttle the card using nvidia-smi either. It is consuming double the power no problems.
The 4060 TI is not physically faulty, my son has been using it without issues the entire time.
I am still convinced this is a driver problem. Throttling the 4060 TI just made problems appear less frequently but they still did. Over time even throttled the 4060 TI would have the same problem playing Dyson Sphere Project and locked up just minutes after loading a save. Without a hardware riser and a scope I can never disprove power draw issues but as a developer this reeks of a race condition.
As far as the 3060 TI locking up much less frequently… I guess the driver bits for this specific card are simply more mature.