3060Ti falls off bus when using 3D graphics

tpwatson · November 4, 2021, 7:34pm

nvidia-bug-report.log.gz (125.9 KB)
nvidia-bug-report-aspmoff.log.gz (126.2 KB)

Whenever I try to start an application that uses 3D graphics of moderate intensity, the GPU crashes and falls off the bus. It typically won’t happen just using the Unity desktop or the Firefox browser, but if I try to start a benchmark like Unigine Valley, it crashes while loading. The logs are from Ubuntu 21.10 and driver 470.74.

If I turn off ASPM with “pcie_aspm=off” as a kernel command line parameter, the crashes become intermittent instead of certain. I tried the latest driver, 495.44, but it was completely unable to initialize the graphics card and said RMInitAdapter failed.

Curiously, the GPU is totally fine to work for hours at a time using gpu_burn, which stresses it with CUDA. This suggests to me it is not a power or memory issue. Any ideas?

generix · November 5, 2021, 8:56am

You’re getting a XID 79, fallen off the bus, like you already noticed. Most common reasons are overheating or lack of power. Since gpu-burn doesn’t provoke it, I guess temperatures are fine, also standard power draw. So I’m leaning towards power peaks being the reason. Please reseat power connectors/the card in its slot, try different power connectors, check/replace PSU. What psu model is built into the system?

tpwatson · November 5, 2021, 7:03pm

The PC is an industrial computer with a DC-DC converter for the graphics card. I reseated everything and the behavior did not change. I hooked my oscilloscope up to the graphics card with current and voltage probes and observed that the voltage on the PCIe power connector on the GPU board was rock solid even with high peaks in current consumption. I also observed that the crashes are not correlated with peaks.

I think the PC has a motherboard issue which is causing PCIe communication problems. There are lots of correctable PCIe errors in the kernel logs. I tried the card in another computer and it seemed to work totally fine.

Does NVIDIA have any other tools which might assist in diagnosing this? The documentation page on XID errors says “NVVS can check for basic GPU health, including the presence of ECC errors, PCIe problems, bandwidth issues, and general problems with running CUDA programs.”, but I can’t locate this program.

generix · November 6, 2021, 12:17pm

I guess you only get this per personal support request.
The aer messages are from the pcie bridge the nvidia gpu is connected to, this could really be the issue. Don’t know if kernel parameter pci=noaer is a valid workaround but I’d rather check for a bios upgrade and/or contact mainboard vendor support about a replacement

Topic		Replies	Views
RTX 3090 GPU crashes (EDIT: power supply issue) Linux vulkan	3	1428	April 24, 2024
NVIDIA 515 - RTX 3060 - GPU has fallen off the bus Linux hw , nvbugs , kb	20	4356	March 1, 2024
"Xid:79, GPU has fallen off the bus" training a deep learning model on Nvidia 3090 Linux nvbugs	0	583	September 21, 2023
Xid 79, GPU has fallen off the bus. CUDA Programming and Performance	15	25716	August 13, 2023
GTX 1080 Ti falling off bus Linux	19	2321	September 3, 2018
Nvidia driver Xid 79 GPU crash while idling if ASPM L0s is enabled in UEFI BIOS (GPU has fallen off the bus) Linux linux , linux-driver	4	98	December 22, 2024
Fix "Xid 79 GPU has fallen off the bus" already! Linux	1	1635	January 10, 2021
Gefore RTX 3060Ti repeatedly falls off bus Linux	3	1179	May 6, 2021
Gpu has fallen off the bus Ubuntu 18.04 Linux kernel , ubuntu	13	1569	February 15, 2021
Xid 79: GPU has fallen off the bus on 3050ti laptop Linux	7	103	October 27, 2024

3060Ti falls off bus when using 3D graphics

Related topics