System is an Intel Quartz Canyon NUC (NUC9VXQNX), with ECC memory and an Asus RTX 3060 Ti. Running Debian (sid/bullseye).
I’ve owned it for a couple of months, and have had a number of crashes, which on investigation have all been pretty much identical. Example of logged errors is:
May 5 02:54:21 mordor kernel: [527811.962976] NVRM: GPU at PCI:0000:01:00: GPU-8ece01e1-ec62-3b99-2717-75ddf21aaee1
May 5 02:54:21 mordor kernel: [527811.962979] NVRM: GPU Board Serial Number:
May 5 02:54:21 mordor kernel: [527811.963004] NVRM: Xid (PCI:0000:01:00): 79, pid=0, GPU has fallen off the bus.
May 5 02:54:21 mordor kernel: [527811.963007] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
May 5 02:54:21 mordor kernel: [527811.963008] NVRM: GPU 0000:01:00.0: GPU is on Board .
May 5 02:54:21 mordor kernel: [527811.963026] NVRM: A GPU crash dump has been created. If possible, please run
May 5 02:54:21 mordor kernel: [527811.963026] NVRM: nvidia-bug-report.sh as root to collect this data before
May 5 02:54:21 mordor kernel: [527811.963026] NVRM: the NVIDIA kernel module is unloaded.
May 5 02:54:48 mordor kernel: [527839.014516] show_signal_msg: 9 callbacks suppressed
May 5 02:54:48 mordor kernel: [527839.014518] GpuWatchdog[140092]: segfault at 0 ip 00007f32ec34abdd sp 00007f32e0e3a4d0 error 6 in libcef.so[7f32e85c4000+69a4000]
May 5 02:54:48 mordor kernel: [527839.014551] Code: 00 79 09 48 8b 7d a0 e8 61 80 c1 02 41 8b 85 00 01 00 00 85 c0 0f 84 ab 00 00 00 49 8b 45 00 4c 89 ef be 01 00 00 00 ff 50 58 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 01 a6 37 03 01 80 bd 7f ff
May 5 02:57:22 mordor assert_20210505025722_35.dmp[145677]: Uploading dump (out-of-process)#012/tmp/dumps/assert_20210505025722_35.dmp
May 5 02:57:24 mordor assert_20210505025722_35.dmp[145677]: Finished uploading minidump (out-of-process): success = yes
May 5 02:57:24 mordor assert_20210505025722_35.dmp[145677]: response: CrashID=bp-3e1e7828-aa4b-4c71-9925-6655c2210504
May 5 02:57:24 mordor assert_20210505025722_35.dmp[145677]: file ''/tmp/dumps/assert_20210505025722_35.dmp'', upload yes: ''CrashID=bp-3e1e7828-aa4b-4c71-9925-6655c2210504''
When this happens, the system remains accessible over the network, but I have to reboot it to get the video back. Running nvidia-smi shows
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 306... On | 00000000:01:00.0 On | N/A |
| 54% 49C P5 24W / 200W | 2953MiB / 7979MiB | 39% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1285 G /usr/lib/xorg/Xorg 332MiB |
| 0 N/A N/A 2129 G /usr/bin/enlightenment 164MiB |
| 0 N/A N/A 2617 G ./factorio/bin/x64/factorio 1958MiB |
| 0 N/A N/A 2985 G ...gAAAAAAAAA --shared-files 493MiB |
+-----------------------------------------------------------------------------+
This has been happenign with multiple driver versions (it’s taken me a little time to narrow it down and the drivers got updated before I got round to debugging).
The system can be stable for up to about a week, and then (in the worst case) has crashed 11 times in 1 evening. It seems to particularly dislike Wednesdays … (I really hope that’s coincidental, otherwise I fear for my sanity).
After seeing a particular forum post I tried booting with pcie_aspm=off, but gave up on that quickly as it rendered the thunderbolt ports (which I’m using) useless.
I’ve spoken to the system vendor, as a hardware fault seems possible - they’ve a) not got any 3060 cards in stock (surprise) and b) asked me to try stress testing it (memtest86, Unigine Heaven, gpuburn), which I will do shortly, but haven’t yet. The crashed do not seem to correlate with high stress though - I’ve played 3d games for hours with no problems, and then had it fall over with almost no load.
Any help/pointers much appreciated.