I have this strange issue that during normal operations of my production machines, the systems starts to fail with
NVRM: GPU 0000:06:00.0: GPU has fallen off the bus. I have tried to collect all possible information for the system. I run a containerized software that uses nvidia-docker2.
I have tried updating the driver to:
version: 460.84 srcversion: EA32CEBBA576FA0CDF3786B vermagic: 4.15.0-144-generic SMP mod_unload modversions
But no avail, it keeps crashing randomly.
I have multiple machines and all the power supply are working as intended, so I do not believe it is a hardware issue, any help?
nvidia-bug-report.log.gz (904.2 KB)