440.82, 1660 Ti. Intermittent hangs and freezes during normal desktop operations, [Xid 32, 13, 69, 12]

Hi there,

my computer intermittently hangs using chromium and various desktop applications (gimp).nvidia-bug-report.log.gz.removeme.log (785.8 KB)

during the hangs, dmesg produces loads of NVRM errors, mostly Xid 32:
NVRM: Xid (PCI:0000:01:00): 32, pid=205057, Channel ID 00000020 intr0 00040000

but sometimes more creative stuff like
Xid (PCI:0000:01:00): 13, pid=205178, Graphics Exception: Class 0x40 Subchannel 0x2 Mismatch

… and similar.

Various info:
CPU model name : Intel® Core™ i7-8700 CPU @ 3.20GHz

lspci output:
01:00.0 VGA compatible controller: NVIDIA Corporation TU116 [GeForce GTX 1660 Ti] (rev a1)
Subsystem: PC Partner Limited / Sapphire Technology TU116 [GeForce GTX 1660 Ti]
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia


Wed May 27 16:40:09 2020
| NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 |

XID 32 points to defective/incompatible system memory.
Please check/replace/swap memory modules.

I just ran memtest86 for a few days on this computer, and it reports no errors. I don’t have insights beyond that with respect to modern DRAM interfaces, but I find the errors to be puzzling from a hardware engineer point of view.

When you say “incompatible” – what does his actually mean? Are there some DRAM modules that somehow have strange timing characteristics with respect to the DMA controller or something? Any details are much appreciated with respect to debugging.

As an aside, I have tried to write a CUDA program that loads my GPU memory from CPU memory and back again in a loop, but I can’t seem to trigger this error when I run my application. I assume the driver uses the DMA controller to do the transfer, but further insight into the workings of the driver is not possible without the source code.

‘Incompatible’ in regard to mainboard/bios/other memory modules (e.g. mixing modules from different vendors with different timings/clocks)
I find memtest86 rather unreliable, better remove memory modules and use only one at a time and see if you can replicate the error.