Error: Tesla K80: NVRM: GPU at 0000:05:00.0 has fallen off the bus.

i,
I’m using NVIDIA K80 for running machine learning tool (caffe). I’m using cifer10 example.

NVIDIA driver version 352.99
OS: Ubuntu 14.04 LTS 64-bit
CUDA 7.5

I’m getting below error during cifer10 quick training.

I0915 13:43:14.459393 5849 solver.cpp:244] Train net output #0: loss = 0.856953 (* 1 = 0.856953 loss)
I0915 13:43:14.459401 5849 sgd_solver.cpp:106] Iteration 2100, lr = 0.001
I0915 13:43:26.522061 5849 solver.cpp:228] Iteration 2200, loss = 0.744258
I0915 13:43:26.522265 5849 solver.cpp:244] Train net output #0: loss = 0.744258 (* 1 = 0.744258 loss)
I0915 13:43:26.522284 5849 sgd_solver.cpp:106] Iteration 2200, lr = 0.001
I0915 13:44:01.383584 5849 solver.cpp:228] Iteration 2300, loss = 0.808959
I0915 13:44:01.383838 5849 solver.cpp:244] Train net output #0: loss = 0.808959 (* 1 = 0.808959 loss)
I0915 13:44:01.383875 5849 sgd_solver.cpp:106] Iteration 2300, lr = 0.001
I0915 13:44:43.551292 5849 solver.cpp:228] Iteration 2400, loss = 0.776389
I0915 13:44:43.551502 5849 solver.cpp:244] Train net output #0: loss = 0.776389 (* 1 = 0.776389 loss)
I0915 13:44:43.551528 5849 sgd_solver.cpp:106] Iteration 2400, lr = 0.001
F0915 13:45:00.165660 5849 math_functions.cu:79] Check failed: error == cudaSuccess (30 vs. 0) unknown error
*** Check failure stack trace: ***
@ 0x7f2f5a316daa (unknown)
@ 0x7f2f5a316ce4 (unknown)
@ 0x7f2f5a3166e6 (unknown)
@ 0x7f2f5a319687 (unknown)
@ 0x7f2f5aac9458 caffe::caffe_gpu_memcpy()
@ 0x7f2f5aa81efe caffe::SyncedMemory::to_gpu()
@ 0x7f2f5aa813a9 caffe::SyncedMemory::gpu_data()
@ 0x7f2f5a91ab52 caffe::Blob<>::gpu_data()
@ 0x7f2f5aa9e90c caffe::BasePrefetchingDataLayer<>::Forward_gpu()
@ 0x7f2f5a92d8f5 caffe::Net<>::ForwardFromTo()
@ 0x7f2f5a92dc67 caffe::Net<>::Forward()
@ 0x7f2f5aa88b77 caffe::Solver<>::Step()
@ 0x7f2f5aa89439 caffe::Solver<>::Solve()
@ 0x40873b train()
@ 0x405b3c main
@ 0x7f2f59322f45 (unknown)
@ 0x4063ab (unknown)
@ (nil) (unknown)
Aborted (core dumped)

dmesg output:

[ 1461.865204] NVRM: GPU at 0000:05:00.0 has fallen off the bus.
[ 1461.865228] NVRM: GPU is on Board 0321215002951.
[ 1461.865241] NVRM: A GPU crash dump has been created. If possible, please run
[ 1461.865241] NVRM: nvidia-bug-report.sh as root to collect this data before
[ 1461.865241] NVRM: the NVIDIA kernel module is unloaded.
[ 1461.865256] NVRM: GPU at 0000:06:00.0 has fallen off the bus.
[ 1461.865276] NVRM: GPU is on Board 0321215002951.

Please let me know how to fix it.

Thanks,
Arup

This usually happens when the K80 is not properly installed in a properly qualified OEM system, designed to support K80.

What sort of system is this K80 installed in?

The 3 areas most likely to cause this are power delivery, thermal management, and signal integrity on the PCIE bus.

What txbob said. A few simple things to check:

[1] GPUs without active cooling (that is, without an integrated fan) require a server-type enclosure that provides adequate airflow across the passive heat sink. Sticking a passively-cooled GPU into a regular workstation case will cause overheating in as little as 30 seconds. Luckily the GPUs have sensors and shut themselves down when that happens, so permanent damage is avoided. Placing GPUs with active cooling inside a server enclosure can also be problematic as the airflow paths (case vs GPU) likely don’t harmonize. As far as I know, all K80s are passively-cooled devices.

[2] Make sure the GPU is firmly seated in the PCIe connector to ensure both signal integrity and power delivery (PCIe card can draw up to 74W through the connector, according to spec); if possible avoid riser cards. If there is a fixation mechanism (e.g. a screw that fixates a bracket on the GPU to the case) make sure that mechanism is engaged. There are various sources of vibrations in PCs and servers that can cause plug-in devices to wiggle out of their connectors over time.

[3] Check all auxiliary power connectors. Many GPUs have simple 6-pin or 8-pin PCIe power connectors (or both); I believe the K80 has a more advanced power connector, don’t recall what it is called. Usually the connectors have tabs designed to snap into place once the plug is inserted completely, make sure that is the case.

[4] Make sure the power supply unit (PSU) provides sufficient wattage. As a rule of thumb you would want the total rated wattage of all system components combined to be 50%-60% of the rated wattage of the PSU for optimal robustness and efficiency. Use of an 80 PLUS Platinum rated PSU is highly recommended (some recent servers may come with a 80 PLUS Titanium PSU, which is the most advanced kind of power supply currently available and a tad superior to the Platinum class). A useful overview of 80 PLUS rated PSU can be found here: [url]80 Plus Overview | CLEAResult