This did not take long, GPUs fell of the bus again. So, Performance mode on all cards did not help - in less than 9 hours of uptime, GPUs fell of the bus once more. After reboot, with Adaptive mode, I do not see higher than usual idle power consumption, so the suspend/resume workaround is not applicable, and it wasn’t applicable in the Performance mode either, so I conclude the issue at hand is not related to the other Nvidia driver bug with higher than usual idle power consumption.
Here, I attach the latest debug log:
nvidia-bug-report.log.gz (245.4 KB)
But I do not know what else to try this time?
I think I exhausted all possibilities, and bug seems to manifest completely at random - may happen in just few hours, or not happen for more than two weeks that I started to think it was solved. But apparently not.
However, I ruled out hardware or power related issues, so Nvidia driver bug on the MZ32-AR1-rev-30 motherboard is only explanation as far as I can tell, at least in combination with 3090 cards. I know video cards, cables and PSU are good because they work well on a gaming motherboard where I had them previously. On the new motherboard, reducing number of connected GPUs or trying different cable or PSU does not help. Replacing new MZ32 AR1 motherboard with another new one is also I mentioned trying, and made no difference, and the motherboard itself is completely stable without Nvidia cards, so I am sure it is good. BIOS is of the latest available version. Hence, Nvidia driver bug is the only remaining explanation.
If anyone can suggest some possible workaround or any additional information I may need to provide to pinpoint the issue further - I would be very grateful.