We’re having trouble getting our TYAN S7015 based 8 x GTX 580 GPU box to run reliably.
After a fresh reboot, our CUDA regression tests usually run fine for quite a while… However, after a random period of time (a day or so), some of the cards start getting NVRM Xid errors… From dmesg:
NVRM: Xid (0084:00): 13, 0001 00000000 000090c0 00001b0c 00000000 00000000
After this, the 0084:00 GPU is going into a quite dodgy state, with kernels randomly getting “launch failed” errors and generally producing incorrect results. It seems like the Xid error caused a GPU memory corruption or something that the card is unable to recover from. Reloading the nvidia kernel module doesn’t help. Even after a soft reboot, the failing card often still behaves badly. Only power cycling the whole server seems to restore the GPU to a good state.
My first question is, is there any way to trigger a GPU “hard reset” from software, without power cycling the server, in order to try to work around this kind of errors? Something like nvidia-smi --hard-reset -g 0 would be really useful. Is it even possible with current hardware?
Second questions is, does anyone else out there have any experiences to share from trying to run CUDA under Linux on an 8 x GTX 5x0 GPU server similar to ours? Does it work for you, or are you running into similar problems? Any suggestions would be greatly appreciated.
nvidia-bug-report.log.gz attached… Btw, we’re loading the kernel module with NVreg_EnableMSI=1 which seems to improve system stability.