System freeze - BlackMagic Resolve - CentOS - driver 381.22

Hi all
I’m consulting for a top WW VFX studio.
Personally I think the problem cause is that they have both GTX and a Quadro in the system.
A big nono from another case I had but then in Windows and for VRED.

Problem description from customer below. I have bug report in gz if needed.

"Basically the problem we are having results in the error “NVRM: GPU at 0000:88:00.0 has fallen off the bus.” which results in a kernel crash and a automatic reboot.

The cards in question are 4x MSI Founders Edition GTX 1080Ti. We have four of the 1080 Ti cards plus a Quadro M4000 in a Supermicro SYS-4028GR-TR chassis.

We run CentOS 7.3 with the latest updates. Right now we are on driver 381.22. Earlier we ran 381.09 (beta). "

Any insights appreciated.

Many thanks in advance
WBR
Mats

Supermicro+Compute, maybe this:
https://devtalk.nvidia.com/default/topic/525144/tesla-k10-34-has-fallen-off-the-bus-34-/

The thread generix linked to has some good suggestions. Usually, when a GPU falls off the bus under heavy load, it’s either because it’s overheating, or because the system’s power supply can’t handle that high of a sustained load. It’s also possible that your particular motherboard or set of GPUs is flaky. There’s a small chance that a driver bug is causing the problem, but if that’s the case you’ll need to get a reliable set of reproduction steps for us to investigate.

Hi again
Many thanks for great response!

A summary from my POV;

  • Not likely caused by mixing card families - GTX+Quadro
  • Most likely either driver or system instability/insufficiency on mobo or PSU.

I summarized to customer and attached a copy of all related threads.

Have a great weekend!

WBR
Mats