GPU has fallen off the bus... Requires your serious attention

On Recent Linux Distro/Kernels, when the machine goes into idle or used very slightly with very low CPU processing, suddenly the “GPU falls off the bus” and the machine freezes, ssh on the machine you can access it, but it’s now working without a display card!

And I want here to raise to your attention that this is NEITHER a BIOS nor a PS issue, the same cards on the same workstations, with the same BIOS version and same power supply work as expected on older Distro/Kernels…
Anyway, it’s doesn’t seem to be a driver issue too! And it happens even when the NVidia driver isn’t installed!

However, the issue requires some attention from your engineers to figure out why the RTX cards have such issue with the recent linux distro! It’s your product at the end, and we expect you to troubleshoot the issue and tell us what to do!

FYI, all the issue is not something new! the intel idle c_state is behind it, loading the kernel with idle=nomwait (which disables the intel idle driver and uses the ACPI driver instead fixes the issue) however, this is consuming too much energy and makes the workstation really noisy and probably hotter!

So, this is the case, if you will continue assuming that it’s something wrong with our hardware, the issue will never get resolved! If there is something wrong on our hardware, then it’s your PCIe cards design lacking some sort of power regulator, or requires a firmware update that allows the card to deal with the power reduction that happens when the system kernel activate the processor idle via the c_state levels.


I may share with you what I did, which could help you have a clue about where the bug lives… my tests led me to doubt about the glibc! Even if this doesn’t make sens, but here is why:

  • Old Distro (rhel 7 & 8), with older kernel and older glibc… Succeeded
  • Recent Distro (rhel 9), with recent kernel and recent glibc… Fails on idle.
  • Compiled Old kernel on recent distro, with recent glibc… Fails on idle.
  • Compiled a recent kernel on old distro with old glibc… Succeeded.

From the tests I’ve made, it works fine on old disto (with old glibc) regardless to if the kernel is recent or old.
And it fails on recent Distro, with recent glibc, regardless to the kernel version if it’s recent or old.

Obviously, the only common thing between the failure scenarios, is glibc, as I understand, nothing else in the OS is involved in such thing, it’s the kernel and the hardware, since both were proven to be ok, then my doubts goes to some sort of a bug in glibc! Even when I’m not sure if glibc could be involved in this!

hi,

could you try disable PCIe ASPM in BIOS and set ‘pcie_aspm=off’ in kernel ?

Regards,
Levei

Thanks…
Interesting feature! I didn’t know about it… I will give it a try and let you know.

unfortunately this didn’t help, the card doesn’t wake up after idle (although this kernel option as I read was supposed to disable the pcie idle!)…

And I have a doubt here that when the display goes off on idle, the card doesn’t wake up before detecting the display back on again, which doesn’t happen for some reason! I don’t know if this make any sense… could such thing be an OS related thing?!