We have have a server with 4 GTX580.
When running a benchmark with simple cuda matrix-multiplication on all 4 boards, it crashes after 1/2 hour.
-cudaSetDevice() fails,
-calls to nvidia-smi hangs
less /var/log/messages | grep ‘NVRM’
gives:
Jul 5 22:16:56 guppy4 kernel: [2259437.389739] NVRM: GPU at
0000:83:00.0 has fallen off the bus.
thanks for any help inadvance