Whenever I try to start an application that uses 3D graphics of moderate intensity, the GPU crashes and falls off the bus. It typically won’t happen just using the Unity desktop or the Firefox browser, but if I try to start a benchmark like Unigine Valley, it crashes while loading. The logs are from Ubuntu 21.10 and driver 470.74.
If I turn off ASPM with “pcie_aspm=off” as a kernel command line parameter, the crashes become intermittent instead of certain. I tried the latest driver, 495.44, but it was completely unable to initialize the graphics card and said RMInitAdapter failed.
Curiously, the GPU is totally fine to work for hours at a time using gpu_burn, which stresses it with CUDA. This suggests to me it is not a power or memory issue. Any ideas?
You’re getting a XID 79, fallen off the bus, like you already noticed. Most common reasons are overheating or lack of power. Since gpu-burn doesn’t provoke it, I guess temperatures are fine, also standard power draw. So I’m leaning towards power peaks being the reason. Please reseat power connectors/the card in its slot, try different power connectors, check/replace PSU. What psu model is built into the system?
The PC is an industrial computer with a DC-DC converter for the graphics card. I reseated everything and the behavior did not change. I hooked my oscilloscope up to the graphics card with current and voltage probes and observed that the voltage on the PCIe power connector on the GPU board was rock solid even with high peaks in current consumption. I also observed that the crashes are not correlated with peaks.
I think the PC has a motherboard issue which is causing PCIe communication problems. There are lots of correctable PCIe errors in the kernel logs. I tried the card in another computer and it seemed to work totally fine.
Does NVIDIA have any other tools which might assist in diagnosing this? The documentation page on XID errors says “NVVS can check for basic GPU health, including the presence of ECC errors, PCIe problems, bandwidth issues, and general problems with running CUDA programs.”, but I can’t locate this program.
I guess you only get this per personal support request.
The aer messages are from the pcie bridge the nvidia gpu is connected to, this could really be the issue. Don’t know if kernel parameter pci=noaer is a valid workaround but I’d rather check for a bios upgrade and/or contact mainboard vendor support about a replacement