Hard lockups after exiting Team Fortress 2 (Xid: 62)

Hi,

my machine often (~15% of the times) locks up with blank black screen a few seconds (1-3s) after exiting Team Fortress 2 on linux. The lockup is hard (can only use hard reset to get rid of it), the machine does not react to pings while it happens.

It rarely also lock up even during the game menu, but it does not do so when in the gameplay itself, only when in menu screens. It does not lock up during general PC usage, and I didn’t try any other games much, but it only seems to do so when exiting TF2 (never had a lockup when running a different openGL application).

Machine is on current arch linux (kernel 4.0.7-2-ARCH, nvidia driver 352.21-2) but I have this problem for about half a year already. GPU is GTX 770, core i5 cpu 2550K.

I’ve tried monitoring the GPU temperature and it does not go above 70 Degrees Celsius.

I set up a remote dmesg logging via netconsole to try to catch the problem, and this is what it spilled:

First time:

[19369.835521] NVRM: GPU at PCI:0000:01:00: GPU-a74182f0-9cfb-d057-7254-b7a624fe7d97
[19369.835545] NVRM: Xid (PCI:0000:01:00): 62, b2dc(5204) 00000000 00000000
[19378.048030] NVRM: GPU at 0000:01:00.0 has fallen off the bus.

Second time:

[ 4107.394888] NVRM: Xid (PCI:0000:01:00): 62, b2dc(5204) 00000000 00000000

Edit:
Catched another case, this time it was in TF2 main menu, after selecting a game (right when the loading screen should appear, instead I got the blank screen). Left it running for as long as needed, eventually the machine self rebooted.

[20000.832886] NVRM: GPU at PCI:0000:01:00: GPU-a74182f0-9cfb-d057-7254-b7a624fe7d97
[20000.832899] NVRM: Xid (PCI:0000:01:00): 62, b2dc(5204) 00000000 00000000
[20013.503448] hrtimer: interrupt took 94838960 ns

I’ve generated the nvidia-bug-report.log.gz as instructed

nvidia-bug-report.log.gz (203 KB)

I remember a thread where a guy apparently fixed the “GPU has fallen of the bus.” issue by doing the old SNES cartridge trick. Clean the PCIe bus and put the card back in and check if it’s really in there. Might work.

Just did that yesterday, it still happens. I’ll try doing that once again, though. What is suspicious is that it always is after exiting TF2, it is kind of too predictable to seem to be a hardware only issue - i never ever get the problem when playing the game, even for hours of gameplay the PC is stable, the same thing when in browser, etc - no issues at all. Only when exiting the game and sometimes when the loading screen should start.

Well what do you know. I swapped the GPU into the othe PCI-X slot and it seems it vanished. So it was a case of dusty slot, after all. Thanks for the tip.

Edit: I spoke too soon… It still happens. It is not the slot that is the problem then. Either HW (GPU or CPU) or software. I’ll try to investigate further.

https://www.reddit.com/r/tf2/comments/3ex11g/tf2_disabling_my_980_on_close/ Apparently I am not the only person suffering from a similar issue - this time it is windows 10 machine.

We will try to repro this issue internally. We are tracking this under : bug 200127609

Volca, We are not able to repro this issue internally. Team Fortress 2 does appear to be crashing on my system with the script I built, but the error does not appear to be related to the NVIDIA driver. Team Fortress 2 appears to have exited and is not stuck; X is still usable and no Xid errors are reported, so I’m definitely not seeing the same issue as the you.

hey, thanks for getting the time to look at the issue. Also sorry about me not getting here sooner.

I upgraded to 352.41 a few days ago (my log says it was 2015-08-30).

So far, I had no lockup after that. Also, in the meantime TF2 got some patches, so it is hard to determine what stopped the issue.

Either way now I have a rock solid PC and Team Fortress 2 didn’t freeze so far.

Thanks volca for this information.