I’m trying to troubleshoot an issue I’ve been having for a while. I built this PC about 6 months ago. For the first 4 months or so, I had no issues at all. For the past couple months, though, I’ve been having issues with different games where my system will freeze up for 2-3 seconds at a time every 15-20 seconds or so and renders the game unplayable, and in between freezes my FPS dips down to at best 50% of normal. Sometimes rebooting fixes it, sometimes not. There doesn’t seem to be any pattern. Sometimes I’ll have a “good boot” and I can play games fine for hours or days before the issue resurfaces.
The only error indications I’m seeing are in my syslog. When the issue starts happening, I get the following Xid errors:
I’m running Arch Linux (kernel 4.17.9) with NVIDIA driver version 396.24. The issue seems to affect all games I play. At first I thought it was a Wine issue since all the games were being played with Wine, but I get the same issue running a unigine benchmark. I tried running cuda-memtest but it reports no errors even when an application is misbehaving. nvidia-smi doesn’t report any errors, and the card is running at about 60C both when it’s fine and when it’s acting up.
It had crossed my mind that it could be a hardware issue before. Prior to making this post, I had already tried reseating the video card and memory. I have had this issue with the card in several different PCI slots. Unfortunately I do not have another PC to test the card in. Are there any PSU diagnostics that could be done without having another one spare?
I haden’t heard of the Ryzen bug before, but after some Googling, I found the kill-ryzen script. I’ve let it run for upwards of 10 minutes, though, and it does not produce a segfault.
I can’t think of a software test to check the psu. Also, a faulty psu is rather unlikely, that would often lead to a XID 79 which you didn’t encounter. Looking at the collection of XIDs you’re getting and their meaning, I would suspect the gpu. So you better check the card in another system and hope it’s still under warranty if it fails.
Given the amount of things I’ve tried OS-wise, I agree that it seems like it could be a hardware issue. Are there any more utilities or tests I can use to determine if the GPU is faulty for sure? I don’t want to spend hundreds of dollars on a new card just to find out that it wasn’t the source of the problem…
UPDATE in case anyone finds this thread in Google in the future and is banging their head against a wall like I was.
I think I might have finally narrowed down the problem to bad memory. I had tried different DIMMs and reseating my memory modules multiple times, but I just now tried running with one stick of RAM at a time (I have 2x 8GB sticks). The first stick I tried was terrible. As soon as I opened any game things went to hell immediately with XID errors, terrible FPS, and system lockups. With the other stick, though, I’ve had zero XID errors so far and my games are running solid at 60 FPS.
I’ll update the thread if the issue crops up again but I’m pretty confident this was the problem, even though my system only had issues with gaming and the only errors I was getting were from the GPU. I’m not very familiar with memory addressing and where things are physically stored, but I’m assuming my “good boots” were just me getting lucky with the OS not trying to utilize the portions of memory that were corrupted.