Crash with error code XID 62 for GTX 1070ti

Card: GeForce GTX 1070 Ti/PCIe/SSE2
Driver: 4.6.0 NVIDIA 440.82

Linux 5.7.4-arch1-1 #1 SMP PREEMPT Thu, 18 Jun 2020 16:01:07 +0000 x86_64 GNU/Linux

Running simple graphical benchmarks like glmark or games causes the system to freeze and artifact with an XID 62 code.
Have tried reseating and swapping the PCI-E port but no luck so far. Seems to run ok with a older 750ti in the same slot so out of ideas. Any assistance would be appreciated.

Thanks.

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post. You will have to rename the file ending to something else since the forum software doesn’t accept .gz files (nifty!).

Attached bug report with extra .txt extension
nvidia-bug-report.log.gz.txt (226.6 KB)

Got the error message from dmesg when it crashes:
[ 663.371104] NVRM: GPU at PCI:0000:01:00: GPU-bed765b9-eb6c-b537-15da-656f152030a3
[ 663.371105] NVRM: GPU Board Serial Number:
[ 663.371108] NVRM: Xid (PCI:0000:01:00): 62, pid=670, 0a97(2b84) 00000000 00000000

I guess you should check the video memory of the 1070 using cudamemtest:
https://github.com/ComputationalRadiationPhysics/cuda_memtest

Thanks generix. Had a few crashes between installing and running but managed a full successful pass of the 10 cudamemtest tests. My problem seems to be intermittent but causing lockups and freezes sometimes blanking the screen. Not sure what else I can try here as a different card in the same machine seems to function ok.

Please check for a driver regression by downgrading to the 390 driver. Furthermore, you could run gpuburn headless to check the gpu.

Downgraded to the 390.132 driver. Still crashes and artifacts pretty badly sometimes. Running out of ideas now. Seems like it could be bad hardware at this point?

Yes, bad HW but which part, either gpu or a flakey psu. The XIDs you were getting would rather point to bad video memory but cuda_memtest didn’t bring up anything.
If possible, you should check if the gtx1070 works in another system/mainboard.

I did try in another PC for a few hours without any crashing - the crashes on my machine seem sometimes infrequent but sometimes very frequent. Still unsure which part it at the moment. I do have an older 750ti which seems to be working in my PC on the same driver etc. but without crashing. Still could be the power as it only uses 1x6pin vs 2x8 pin for the 1070ti.

I think you can rule out the gpu, otherwise it would be crashing in the other system, too. The mainboard should be fine as well, it’s from the same time as Pascal gpus, has the latest bios flashed. Flakey RAM would crash the Kepler as well. So my next guess would be the psu. Pascal gen gpus raised the requirements regarding power quality, i.e. ripple. Kepler/Maxwell gpus would run fine with some PSUs while Pascal gpus would crash with the same ones. Normally, this would result in a very clear XID 79 but as always, the world is not only black and white.

2 Likes

That’s a relatively inexpensive replacement I can try - hopefully will improve matters with an alternative PSU. Will update after I can test this.

Appreciate the continued help. Thanks again.

Tried with another PSU with no success. Finally managed to get it to reproduce in another computer with crashing/artifacting running some gpu benchmarks from the manufacturer of the card. Think it’s pretty conclusive the card is faulty which is a shame as it was a replacement for another faulty card. Just don’t seem to be having much luck! Thanks for the help @generix