nVidia card has fallen off the bus

Dear all,
I am starting programming with CUDA but I am facing a very hard to fix problem: After some time the systems gives the error:
NVRM: GPU at 0000:03:00.0 has fallen off the bus
And the computer needs to be powered off to detect again the nVidia card.
At first I though it was a fault in my code: If I ran the same executable for 1000 times, the first 200 iterations were OK giving the same output, but then the system gave the aforementioned error and all the remaining iteration were giving errors. I then took the matrixMul example from cuda, compiled it, and ran it 1000 times. The same error happened around iteration 200!. That pointed me to some driver problem. Therefore, and unfortunately without any success, I tested the same procedure with:

  • Several drivers, some old (which google results stated could fix the problem), the latest long lived, the latest experimental, beta, etc.
  • Cuda 5 and cuda 4.2 with the aforementioned drivers
  • I booted on text only without
  • I removed xorgserver completely
  • Enabled persistent mode.
    None of the previous worked.
    Please remember the very simple test: I compile the matrixMul example and run the executable for 1000 times. I tested this also on my macbook pro and everything went fine (although of course different SO, card, etc). I am clueless right now. What I haven’t tested yet:
  • Another kernel version.
  • Another linux distribution.
    This is my system info:
    Ubuntu 12.04.2
    Cuda 5
    Current driver version : 313.30
    Ubuntu kernel : 3.2.
    g++ version : 4.6
    nVidia Card : Quadro 4000 (GF 100)
    I am compiling with the simple make command, following exactly the examples without modifying them.
    Please, if you have any suggestion, let me know.
    Thanks in advance.

This sounds like it could be bad hardware… either bad motherboard, bad card, or bad power supply / not enough power supplied to the card. You’d have to see which exactly is the case, going through one of these issues at a time until the problem goes away…

ie:

  1. try card with different power supply - works? done, problem is power supply, otherwise
  2. try moving card from one slot of mobo to another - works? done, problem is mobo slot, otherwise
  3. try moving card to another system - works? done, otherwise bad card or bad mobo

my guess is (1) is the issue… but please report back if you solve issue.