GPU at 0000:02:00.0 has fallen off the bus.

kalman · November 25, 2011, 11:56am

This morning I had this on my logs:

Nov 25 05:33:46 essa-s-n1 kernel: [60232.517354] NVRM: GPU at 0000:02:00.0 has fallen off the bus.

Does anyone knows what this means?

I’m using CUDA4.0 and drivers 280.13

Regards

alexish · November 25, 2011, 4:02pm

I have exactly the same problem.
Randomly the video card i use for display (9500GT, ubuntu 10.04, Cuda 4.1 and driver 285.05.15) crash with this error. That as nothing to do with cuda (i have others cards dedicated to computation) and occurs even when i am not running any cuda programs.
In my case this error shows up with the driver 285.05.15.

kalman · November 25, 2011, 6:16pm

The behaviour is strange since we used the driver 280.10 all over the weekend (a CUDA program running for 72hours) and we had no problem, this night we had the problem.

I switched to 290.10 let’s hope.

alexish · November 26, 2011, 6:23pm

I made a mistak ein my previous post, the video card that as this problem is a GT220 attached to the display. By the way i have the same machine at work with a GT9500 instead of the GT220 that never experienced any problem. Both machine have 2xGTX480 for computation.
In your case which card is 0000:02:00.0 ?

kalman · November 28, 2011, 10:53am

My server hosts 4 x Tesla C2050 so it’s one of those. This morning again the same problem with the new 290.10 drivers, I have changed

back to 270.41.19 ( the one we used before the 280 series) and I hope this was the problem.

alexish · November 28, 2011, 1:42pm

Yes, unless your server have a video card …

kalman · November 28, 2011, 5:10pm

no luck, driver version 270.41.19 and my application got stuck on a cudaMemcpy,
and htop shows a CPU core at 100%

(gdb) bt
#0 0x00007fdb60e39197 in ioctl () from /lib/libc.so.6
#1 0x00007fdb684a50a5 in ?? () from /usr/lib/libcuda.so.1
#2 0x00007fdb68453348 in ?? () from /usr/lib/libcuda.so.1
#3 0x00007fdb6844f69f in ?? () from /usr/lib/libcuda.so.1
#4 0x00007fdb68408b14 in ?? () from /usr/lib/libcuda.so.1
#5 0x00007fdb683f41f9 in ?? () from /usr/lib/libcuda.so.1
#6 0x00007fdb683e9ed0 in ?? () from /usr/lib/libcuda.so.1
#7 0x00007fdb683c9e41 in ?? () from /usr/lib/libcuda.so.1
#8 0x00007fdb683ce508 in ?? () from /usr/lib/libcuda.so.1
#9 0x00007fdb683c04b4 in ?? () from /usr/lib/libcuda.so.1
#10 0x00007fdb61ab6e64 in ?? () from /usr/local/cuda/lib64/libcudart.so.4
#11 0x00007fdb61ada824 in cudaMemcpy () from /usr/local/cuda/lib64/libcudart.so.4

no “fallen off the bus” message this time, I’m starting to think that one of my GPU
Card is broken or not well inserted in the bus (can it be?) and then it goes in fault
status until next reboot?

Nvidia-smi gives:

$ nvidia-smi -L
Failed to initialize NVML: Unknown Error
$ nvidia-smi -q -i 0
Failed to initialize NVML: Unknown Error
$ nvidia-smi -q -i 1
Failed to initialize NVML: Unknown Error
$ nvidia-smi -q -i 2
Failed to initialize NVML: Unknown Error
$ nvidia-smi -q -i 3
Failed to initialize NVML: Unknown Error

Topic		Replies	Views
nVidia card has fallen off the bus CUDA Setup and Installation	1	1597	April 16, 2013
kernel: [7766925.279896] NVRM: GPU at 0000:89:00.0 has fallen off the bus Linux	1	1055	November 18, 2016
CUDA 4 + driver 270.35 (C2050) random errors CUDA Programming and Performance	13	18765	April 7, 2011
GPU has fallen of the bus, nvidia-361.28, kernel 4.2.0 Linux	1	1635	February 28, 2016
GPU has fallen off the bus GPU - Hardware	0	995	October 25, 2019
NVRM: GPU at 0000:01:00.0 has fallen off the bus CUDA Programming and Performance	2	6937	September 1, 2011
Ubuntu 16.04 GTX 750 Ti GPU has fallen off the bus Linux	0	1617	December 26, 2016
Ubuntu 17.10, Nvidia 390.48, CUDA 9.1, GPU has fallen off the bus Linux	1	1951	April 24, 2018
Ubuntu 20.04 - RTX3090 - GPU has fallen off the bus Linux cuda , tensorflow , ubuntu , linux	6	4283	December 26, 2021
Will Ubuntu 16.04 crash after the GPU crashes? CUDA Programming and Performance	2	797	November 30, 2018

GPU at 0000:02:00.0 has fallen off the bus.

Related topics