GPU has fallen off the bus... Requires your serious attention

On Recent Linux Distro/Kernels, when the machine goes into idle or used very slightly with very low CPU processing, suddenly the “GPU falls off the bus” and the machine freezes, ssh on the machine you can access it, but it’s now working without a display card!

And I want here to raise to your attention that this is NEITHER a BIOS nor a PS issue, the same cards on the same workstations, with the same BIOS version and same power supply work as expected on older Distro/Kernels…
Anyway, it’s doesn’t seem to be a driver issue too! And it happens even when the NVidia driver isn’t installed!

However, the issue requires some attention from your engineers to figure out why the RTX cards have such issue with the recent linux distro! It’s your product at the end, and we expect you to troubleshoot the issue and tell us what to do!

FYI, all the issue is not something new! the intel idle c_state is behind it, loading the kernel with idle=nomwait (which disables the intel idle driver and uses the ACPI driver instead fixes the issue) however, this is consuming too much energy and makes the workstation really noisy and probably hotter!

So, this is the case, if you will continue assuming that it’s something wrong with our hardware, the issue will never get resolved! If there is something wrong on our hardware, then it’s your PCIe cards design lacking some sort of power regulator, or requires a firmware update that allows the card to deal with the power reduction that happens when the system kernel activate the processor idle via the c_state levels.


I may share with you what I did, which could help you have a clue about where the bug lives… my tests led me to doubt about the glibc! Even if this doesn’t make sens, but here is why:

  • Old Distro (rhel 7 & 8), with older kernel and older glibc… Succeeded
  • Recent Distro (rhel 9), with recent kernel and recent glibc… Fails on idle.
  • Compiled Old kernel on recent distro, with recent glibc… Fails on idle.
  • Compiled a recent kernel on old distro with old glibc… Succeeded.

From the tests I’ve made, it works fine on old disto (with old glibc) regardless to if the kernel is recent or old.
And it fails on recent Distro, with recent glibc, regardless to the kernel version if it’s recent or old.

Obviously, the only common thing between the failure scenarios, is glibc, as I understand, nothing else in the OS is involved in such thing, it’s the kernel and the hardware, since both were proven to be ok, then my doubts goes to some sort of a bug in glibc! Even when I’m not sure if glibc could be involved in this!

hi,

could you try disable PCIe ASPM in BIOS and set ‘pcie_aspm=off’ in kernel ?

Regards,
Levei

Thanks…
Interesting feature! I didn’t know about it… I will give it a try and let you know.

unfortunately this didn’t help, the card doesn’t wake up after idle (although this kernel option as I read was supposed to disable the pcie idle!)…

And I have a doubt here that when the display goes off on idle, the card doesn’t wake up before detecting the display back on again, which doesn’t happen for some reason! I don’t know if this make any sense… could such thing be an OS related thing?!

This week I found a solution that works on my computer. I explained it in this reply. Maybe it can help you.

Hi,
I’ve had similar issue as described within this and many other topics.
The display froze on idle, usually when clicking in the browser or doing nothing once every a few days.

After reading most of this forum for the “fallen off the bus” phrase, I tried many approaches including:

It was also often suggested that the issue is caused by HW failure, PSU problems and such hardware related issues (as for some minority of people it apparently was the case).

Finally, I think the issue in my case was addressed by disabling PCI-e ASPM in BIOS.

For future readers that will look for this solution, my setup is:

  • Intel NUC9 - NUC9VXQNB with the newest BIOS at the moment: QNCFLX70.0077.2024.0801.1649 date: 08/01/2024
  • ASUS DUAL GeForce RTX™ 3060 Ti MINI certified by Intel for the above NUC9 device.
  • the current Nvidia driver is: 570.124.04, but I used different versions including the stock 535/550 drivers supplied with Ubuntu and changing the driver did not solve the issue
  • OS: Linux Mint 22.1 Xia base: Ubuntu 24.04 noble

So, if you have this issue on the above hardware, the first thing you should do is to upgrade the BIOS and disable the PCI-e ASPM somewhere in: Advanced / Power / Secondary Power Settings / PCIe ASPM Support