Gefore RTX 3060Ti repeatedly falls off bus

System is an Intel Quartz Canyon NUC (NUC9VXQNX), with ECC memory and an Asus RTX 3060 Ti. Running Debian (sid/bullseye).

I’ve owned it for a couple of months, and have had a number of crashes, which on investigation have all been pretty much identical. Example of logged errors is:

May  5 02:54:21 mordor kernel: [527811.962976] NVRM: GPU at PCI:0000:01:00: GPU-8ece01e1-ec62-3b99-2717-75ddf21aaee1
May  5 02:54:21 mordor kernel: [527811.962979] NVRM: GPU Board Serial Number:
May  5 02:54:21 mordor kernel: [527811.963004] NVRM: Xid (PCI:0000:01:00): 79, pid=0, GPU has fallen off the bus.
May  5 02:54:21 mordor kernel: [527811.963007] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
May  5 02:54:21 mordor kernel: [527811.963008] NVRM: GPU 0000:01:00.0: GPU is on Board .
May  5 02:54:21 mordor kernel: [527811.963026] NVRM: A GPU crash dump has been created. If possible, please run
May  5 02:54:21 mordor kernel: [527811.963026] NVRM: nvidia-bug-report.sh as root to collect this data before
May  5 02:54:21 mordor kernel: [527811.963026] NVRM: the NVIDIA kernel module is unloaded.
May  5 02:54:48 mordor kernel: [527839.014516] show_signal_msg: 9 callbacks suppressed
May  5 02:54:48 mordor kernel: [527839.014518] GpuWatchdog[140092]: segfault at 0 ip 00007f32ec34abdd sp 00007f32e0e3a4d0 error 6 in libcef.so[7f32e85c4000+69a4000]
May  5 02:54:48 mordor kernel: [527839.014551] Code: 00 79 09 48 8b 7d a0 e8 61 80 c1 02 41 8b 85 00 01 00 00 85 c0 0f 84 ab 00 00 00 49 8b 45 00 4c 89 ef be 01 00 00 00 ff 50 58 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 01 a6 37 03 01 80 bd 7f ff
May  5 02:57:22 mordor assert_20210505025722_35.dmp[145677]: Uploading dump (out-of-process)#012/tmp/dumps/assert_20210505025722_35.dmp
May  5 02:57:24 mordor assert_20210505025722_35.dmp[145677]: Finished uploading minidump (out-of-process): success = yes
May  5 02:57:24 mordor assert_20210505025722_35.dmp[145677]: response: CrashID=bp-3e1e7828-aa4b-4c71-9925-6655c2210504
May  5 02:57:24 mordor assert_20210505025722_35.dmp[145677]: file ''/tmp/dumps/assert_20210505025722_35.dmp'', upload yes: ''CrashID=bp-3e1e7828-aa4b-4c71-9925-6655c2210504''

When this happens, the system remains accessible over the network, but I have to reboot it to get the video back. Running nvidia-smi shows

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 306...  On   | 00000000:01:00.0  On |                  N/A |
| 54%   49C    P5    24W / 200W |   2953MiB /  7979MiB |     39%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1285      G   /usr/lib/xorg/Xorg                332MiB |
|    0   N/A  N/A      2129      G   /usr/bin/enlightenment            164MiB |
|    0   N/A  N/A      2617      G   ./factorio/bin/x64/factorio      1958MiB |
|    0   N/A  N/A      2985      G   ...gAAAAAAAAA --shared-files      493MiB |
+-----------------------------------------------------------------------------+

This has been happenign with multiple driver versions (it’s taken me a little time to narrow it down and the drivers got updated before I got round to debugging).

The system can be stable for up to about a week, and then (in the worst case) has crashed 11 times in 1 evening. It seems to particularly dislike Wednesdays … (I really hope that’s coincidental, otherwise I fear for my sanity).

After seeing a particular forum post I tried booting with pcie_aspm=off, but gave up on that quickly as it rendered the thunderbolt ports (which I’m using) useless.

I’ve spoken to the system vendor, as a hardware fault seems possible - they’ve a) not got any 3060 cards in stock (surprise) and b) asked me to try stress testing it (memtest86, Unigine Heaven, gpuburn), which I will do shortly, but haven’t yet. The crashed do not seem to correlate with high stress though - I’ve played 3d games for hours with no problems, and then had it fall over with almost no load.

Any help/pointers much appreciated.

nvidia-bug-report.log.gz (160.3 KB)

File is a bug report generated by logging into the machine remotely and running nvidia-bug-report.sh before rebooting.

To give a sense of how frequently this can happen (and there have been multiple days where it’s not crashed at all), here’s one day from last week:-

syslog:Apr 28 06:22:04 mordor kernel: [713481.737986] NVRM: Xid (PCI:0000:01:00): 79, pid=1260, GPU has fallen off the bus.
syslog:Apr 28 06:22:04 mordor kernel: [713481.737988] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
syslog:Apr 28 20:19:24 mordor kernel: [35683.102219] NVRM: Xid (PCI:0000:01:00): 79, pid=0, GPU has fallen off the bus.
syslog:Apr 28 20:19:24 mordor kernel: [35683.102220] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
syslog:Apr 28 20:59:47 mordor kernel: [ 2384.798466] NVRM: Xid (PCI:0000:01:00): 79, pid=1439, GPU has fallen off the bus.
syslog:Apr 28 20:59:47 mordor kernel: [ 2384.798473] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
syslog:Apr 28 21:17:10 mordor kernel: [  241.761835] NVRM: Xid (PCI:0000:01:00): 79, pid=1530, GPU has fallen off the bus.
syslog:Apr 28 21:17:10 mordor kernel: [  241.761837] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
syslog:Apr 28 21:49:45 mordor kernel: [ 1340.696519] NVRM: Xid (PCI:0000:01:00): 79, pid=2471, GPU has fallen off the bus.
syslog:Apr 28 21:49:45 mordor kernel: [ 1340.696520] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
syslog:Apr 28 22:04:07 mordor kernel: [  827.258485] NVRM: Xid (PCI:0000:01:00): 79, pid=0, GPU has fallen off the bus.
syslog:Apr 28 22:04:07 mordor kernel: [  827.258487] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
syslog:Apr 28 22:34:03 mordor kernel: [ 1745.475778] NVRM: Xid (PCI:0000:01:00): 79, pid=0, GPU has fallen off the bus.
syslog:Apr 28 22:34:03 mordor kernel: [ 1745.475781] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
syslog:Apr 28 22:46:24 mordor kernel: [  547.193425] NVRM: Xid (PCI:0000:01:00): 79, pid=0, GPU has fallen off the bus.
syslog:Apr 28 22:46:24 mordor kernel: [  547.193451] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.

Most often, XID 79 is caused by overheating or insufficient power. Your description though rather sounds like a failing gpu. Already tried reseating the card in its slot, also reseating, posibly swapping, power connectors?

I have not - I’ll try reseating the card and running stress tests this weekend. A reproducible crash case would help, especially if I can make it die repeatedly under Windows (or even Ubuntu), as the system was bought with no OS, but the vendor is twitching about Debian support - which I sort of understand, though it seems pretty clear that this is not an OS level problem.

I haven’t got detailed logging of power/thermals, but the PSU in this system is the Intel supplied one, and the GPU is on Intel’s qualified components list, so if it was power, I’d expect a faulty PSU - although were that the case I’d expect to see the issues correlate more strongly with GPU loading, which they don’t.

It died on me a lot last night, and as I mentioned ironically to a friend that is disliked Wednesdays I was reminded that I tend to play a multiplayer Factorio game over the network on Wednesday evening, which allocates a couple GB of GPU memory, and uses a large map. I wonder if that could explain that. That game seems loosely correlated with a higher rate of crashes, and has a gfx-safe mode, so more experimentation is in order :-(.