1050ti - FPS drops and freezing

I’m trying to troubleshoot an issue I’ve been having for a while. I built this PC about 6 months ago. For the first 4 months or so, I had no issues at all. For the past couple months, though, I’ve been having issues with different games where my system will freeze up for 2-3 seconds at a time every 15-20 seconds or so and renders the game unplayable, and in between freezes my FPS dips down to at best 50% of normal. Sometimes rebooting fixes it, sometimes not. There doesn’t seem to be any pattern. Sometimes I’ll have a “good boot” and I can play games fine for hours or days before the issue resurfaces.

The only error indications I’m seeing are in my syslog. When the issue starts happening, I get the following Xid errors:

Jul 25 16:58:42 hyperion kernel: NVRM: Xid (PCI:0000:08:00): 8, Channel 00000018
Jul 25 16:58:42 hyperion kernel: NVRM: Xid (PCI:0000:08:00): 32, Channel ID 00000018 intr 00040000
Jul 25 16:58:49 hyperion kernel: NVRM: Xid (PCI:0000:08:00): 13, Graphics Exception: EXTRA_MACRO_DATA
Jul 25 16:58:49 hyperion kernel: NVRM: Xid (PCI:0000:08:00): 13, Graphics Exception: ESR 0x404490=0x80000002
Jul 25 16:58:49 hyperion kernel: NVRM: Xid (PCI:0000:08:00): 13, Graphics Exception: ChID 0018, Class 0000c197, Offset 00002390, Data 00000000

I’m running Arch Linux (kernel 4.17.9) with NVIDIA driver version 396.24. The issue seems to affect all games I play. At first I thought it was a Wine issue since all the games were being played with Wine, but I get the same issue running a unigine benchmark. I tried running cuda-memtest but it reports no errors even when an application is misbehaving. nvidia-smi doesn’t report any errors, and the card is running at about 60C both when it’s fine and when it’s acting up.

nvidia-bug-report.log.gz (105 KB)

Is the issue related to suspend/resume? I have an issue where suspending/resuming breaks the performance until a clean reboot ( see https://devtalk.nvidia.com/default/topic/1027201/linux/linux-suspend-problem/2 ). When you get a “good boot” is it ok until you suspend?

Unfortunately no. This is a desktop PC and I don’t typically utilize the suspend option. Sometimes the issue happens immediately after a fresh boot.

You’re getting XIDs 8,12,13,31,32,69 and maybe others. This points to a hardware issue.

  • check for ryzen bug
  • check psu
  • check/remove memory
  • reseat graphics card, check power connectors, change slot
  • check graphics card in another system

It had crossed my mind that it could be a hardware issue before. Prior to making this post, I had already tried reseating the video card and memory. I have had this issue with the card in several different PCI slots. Unfortunately I do not have another PC to test the card in. Are there any PSU diagnostics that could be done without having another one spare?

I haden’t heard of the Ryzen bug before, but after some Googling, I found the kill-ryzen script. I’ve let it run for upwards of 10 minutes, though, and it does not produce a segfault.

I can’t think of a software test to check the psu. Also, a faulty psu is rather unlikely, that would often lead to a XID 79 which you didn’t encounter. Looking at the collection of XIDs you’re getting and their meaning, I would suspect the gpu. So you better check the card in another system and hope it’s still under warranty if it fails.

Given the amount of things I’ve tried OS-wise, I agree that it seems like it could be a hardware issue. Are there any more utilities or tests I can use to determine if the GPU is faulty for sure? I don’t want to spend hundreds of dollars on a new card just to find out that it wasn’t the source of the problem…

UPDATE in case anyone finds this thread in Google in the future and is banging their head against a wall like I was.

I think I might have finally narrowed down the problem to bad memory. I had tried different DIMMs and reseating my memory modules multiple times, but I just now tried running with one stick of RAM at a time (I have 2x 8GB sticks). The first stick I tried was terrible. As soon as I opened any game things went to hell immediately with XID errors, terrible FPS, and system lockups. With the other stick, though, I’ve had zero XID errors so far and my games are running solid at 60 FPS.

I’ll update the thread if the issue crops up again but I’m pretty confident this was the problem, even though my system only had issues with gaming and the only errors I was getting were from the GPU. I’m not very familiar with memory addressing and where things are physically stored, but I’m assuming my “good boots” were just me getting lucky with the OS not trying to utilize the portions of memory that were corrupted.