I am starting programming with CUDA but I am facing a very hard to fix problem: After some time the systems gives the error:
NVRM: GPU at 0000:03:00.0 has fallen off the bus
And the computer needs to be powered off to detect again the nVidia card.
At first I though it was a fault in my code: If I ran the same executable for 1000 times, the first 200 iterations were OK giving the same output, but then the system gave the aforementioned error and all the remaining iteration were giving errors. I then took the matrixMul example from cuda, compiled it, and ran it 1000 times. The same error happened around iteration 200!. That pointed me to some driver problem. Therefore, and unfortunately without any success, I tested the same procedure with:
- Several drivers, some old (which google results stated could fix the problem), the latest long lived, the latest experimental, beta, etc.
- Cuda 5 and cuda 4.2 with the aforementioned drivers
- I booted on text only without
- I removed xorgserver completely
- Enabled persistent mode.
None of the previous worked.
Please remember the very simple test: I compile the matrixMul example and run the executable for 1000 times. I tested this also on my macbook pro and everything went fine (although of course different SO, card, etc). I am clueless right now. What I haven’t tested yet:
- Another kernel version.
- Another linux distribution.
This is my system info:
Current driver version : 313.30
Ubuntu kernel : 3.2.
g++ version : 4.6
nVidia Card : Quadro 4000 (GF 100)
I am compiling with the simple make command, following exactly the examples without modifying them.
Please, if you have any suggestion, let me know.
Thanks in advance.