I think I finally can close this thread. For other a month, I did not have any fell off the bus issues. And I ended up not replacing my PCI-E 4.0 risers, so new better PCI-E 5.0 risers I spent about $200 on just sit on a shelf doing nothing - I may use them in my new rig that I probably will build in next few years.
I briefly describe my software configuration:
Kernel: 6.14.0-15-generic #15-Ubuntu
Nvidia driver: 570.133.07
I find it stable combination so far.
As of what helped, I am not sure exactly, so I just describe what I have tried. In /etc/default/grub
I have this line:
GRUB_CMDLINE_LINUX_DEFAULT=âignore_rlimit_data pcie_aspm=off rcutree.rcu_idle_gp_delay=1 acpi_osi=! acpi_osi=âLinuxââ
Also, when putting back CPU in my current Gigabyte motherboard from the Gooxie one I tried (where GPUs fell off the bus pretty quickly), I did resit CPU twice (fully putting it in and locking, then unlocking, taking it out and putting it back in).
Also, before putting it in the Gooxie board I used IPA to clean its pads (did not help the Gooxie board to avoid falling off the bus issue though).
Obviously, I also replugged everything else on the motherboard, but it is worth mentioning that just replugging PCI-E risers on the motherboard and cards did not do anything on its own, since I did that so many times that lost count to no avail, along with trying to replug power connectors or even entirely replacing PSU without any effect on the issue.
But right now, the system became stable. The reason why I did not replace risers is because I need to fully rebuild my rig to do that, since they are shorter, so I decided to wait falling off the bus, but it never happened⊠hopefully never will.
I however noticed two things - falling off the bus issue is never preceded by PCI-E error logged, it always caused by an invisible error, and always knocks off all GPUs off the bus, probably due to driver crashing.
At this point, it is hard for me to tell what exactly helped - is purely software issue and I found stable kernel driver combination, was it maybe electromagnetic interference that is no longer there after put everything back together, or maybe resitting CPU helped⊠or even combination of things, including having some options in the GRUB config. This however still narrows it down - since I tried many kernel options that had no effect at all, and these ones did not have it either on their own. So even if they helped, it was in combination with something else that I mentioned.
For now, the system is finally seems to be stable and no sign of the issue for over a month. I would like thank everyone who tried to share an advice or a suggestion, or even just took their time to read the messages.