CUDA Error in multiples machines GPU has fallen under the bus

Hi everyone!

I have this strange issue that during normal operations of my production machines, the systems starts to fail with NVRM: GPU 0000:06:00.0: GPU has fallen off the bus. I have tried to collect all possible information for the system. I run a containerized software that uses nvidia-docker2.

I have tried updating the driver to:

version:        460.84
srcversion:     EA32CEBBA576FA0CDF3786B
vermagic:       4.15.0-144-generic SMP mod_unload modversions

But no avail, it keeps crashing randomly.

I have multiple machines and all the power supply are working as intended, so I do not believe it is a hardware issue, any help?

Thanks!

nvidia-bug-report.log.gz (904.2 KB)

The usual problems here are either a power issue (e.g. overloaded power supply) or a thermal issue (GPUs getting too hot.) Neither one of these are software issues nor can they be fixed with software. Neither one can be adequately discovered from a nvidia-bug-report.log file. The fact that you have already tested 2 different drivers means that it is less likely to be a driver bug.

You can help to discover the thermal issue if it exists, by running nvidia-smi in a loop while your workload is running (perhaps logging to a file) and see what the temperatures are for the GPU that falls off the bus.

A power supply issue is usually just discovered by test replacement, or careful analysis of the load. For example, remove one or more GPUs from the server, and see if the failures disappear.

As Robert Crovella says, an insufficiently-sized power supply is by far the most common reason for GPUs falling off the bus. This could mean the PSU itself is underdimensioned, or there is some issue with the auxiliary power cabling (avoid daisy chaining, Y-splitters, converters).

Beyond the two reasons already mentioned, rarer reasons include:

(3) Use of (poor quality) PCIe riser cards. This is a scenario seen more often in systems used for cryptocurrency mining
(4) Issues with suspend/resume cycles. A scenario more often seen in mobile systems; check ASPM settings in system BIOS and try turning it off to see whether it helps
(5) Very rarely it is an issue with dirty contacts in a PCIe slot or a GPU badly seated in an PCIe slot; make sure GPUs are properly secured at the bracket.