We have a liquid cooled Titan X Pascal machine (4 cards) that haPs been working fine for months. Suddenly one of the cards overheats without apparent reason until it becomes unresponsive. There is no load on the GPUs and the situation persists after a reboot.
The OS is Ubuntu 14.04.
nvidia-smi shows this:
$ nvidia-smi Unable to determine the device handle for GPU 0000:03:00.0: Unknown Error
/var/log/syslog is full of:
kernel: [ 4215.450977] NVRM: request_irq() failed (-22) kernel: [ 4217.554090] NVRM: RmInitAdapter failed! (0x23:0x56:451)
We suspect the card may be damaged, but would appreciate any insights before opening the machine (because of the liquid cooling)
nvidia-bug-report.log.gz (253 KB)