GPU at 0000:02:00.0 has fallen off the bus.

This morning I had this on my logs:

Nov 25 05:33:46 essa-s-n1 kernel: [60232.517354] NVRM: GPU at 0000:02:00.0 has fallen off the bus.

Does anyone knows what this means?

I’m using CUDA4.0 and drivers 280.13

Regards

I have exactly the same problem.
Randomly the video card i use for display (9500GT, ubuntu 10.04, Cuda 4.1 and driver 285.05.15) crash with this error. That as nothing to do with cuda (i have others cards dedicated to computation) and occurs even when i am not running any cuda programs.
In my case this error shows up with the driver 285.05.15.

The behaviour is strange since we used the driver 280.10 all over the weekend (a CUDA program running for 72hours) and we had no problem, this night we had the problem.

I switched to 290.10 let’s hope.

I made a mistak ein my previous post, the video card that as this problem is a GT220 attached to the display. By the way i have the same machine at work with a GT9500 instead of the GT220 that never experienced any problem. Both machine have 2xGTX480 for computation.
In your case which card is 0000:02:00.0 ?

My server hosts 4 x Tesla C2050 so it’s one of those. This morning again the same problem with the new 290.10 drivers, I have changed

back to 270.41.19 ( the one we used before the 280 series) and I hope this was the problem.

Yes, unless your server have a video card …

no luck, driver version 270.41.19 and my application got stuck on a cudaMemcpy,
and htop shows a CPU core at 100%

(gdb) bt
#0 0x00007fdb60e39197 in ioctl () from /lib/libc.so.6
#1 0x00007fdb684a50a5 in ?? () from /usr/lib/libcuda.so.1
#2 0x00007fdb68453348 in ?? () from /usr/lib/libcuda.so.1
#3 0x00007fdb6844f69f in ?? () from /usr/lib/libcuda.so.1
#4 0x00007fdb68408b14 in ?? () from /usr/lib/libcuda.so.1
#5 0x00007fdb683f41f9 in ?? () from /usr/lib/libcuda.so.1
#6 0x00007fdb683e9ed0 in ?? () from /usr/lib/libcuda.so.1
#7 0x00007fdb683c9e41 in ?? () from /usr/lib/libcuda.so.1
#8 0x00007fdb683ce508 in ?? () from /usr/lib/libcuda.so.1
#9 0x00007fdb683c04b4 in ?? () from /usr/lib/libcuda.so.1
#10 0x00007fdb61ab6e64 in ?? () from /usr/local/cuda/lib64/libcudart.so.4
#11 0x00007fdb61ada824 in cudaMemcpy () from /usr/local/cuda/lib64/libcudart.so.4

no “fallen off the bus” message this time, I’m starting to think that one of my GPU
Card is broken or not well inserted in the bus (can it be?) and then it goes in fault
status until next reboot?

Nvidia-smi gives:

nvidia-smi -L Failed to initialize NVML: Unknown Error nvidia-smi -q -i 0
Failed to initialize NVML: Unknown Error
nvidia-smi -q -i 1 Failed to initialize NVML: Unknown Error nvidia-smi -q -i 2
Failed to initialize NVML: Unknown Error
$ nvidia-smi -q -i 3
Failed to initialize NVML: Unknown Error