i,
I’m using NVIDIA K80 for running machine learning tool (caffe). I’m using cifer10 example.
NVIDIA driver version 352.99
OS: Ubuntu 14.04 LTS 64-bit
CUDA 7.5
I’m getting below error during cifer10 quick training.
I0915 13:43:14.459393 5849 solver.cpp:244] Train net output #0: loss = 0.856953 (* 1 = 0.856953 loss)
I0915 13:43:14.459401 5849 sgd_solver.cpp:106] Iteration 2100, lr = 0.001
I0915 13:43:26.522061 5849 solver.cpp:228] Iteration 2200, loss = 0.744258
I0915 13:43:26.522265 5849 solver.cpp:244] Train net output #0: loss = 0.744258 (* 1 = 0.744258 loss)
I0915 13:43:26.522284 5849 sgd_solver.cpp:106] Iteration 2200, lr = 0.001
I0915 13:44:01.383584 5849 solver.cpp:228] Iteration 2300, loss = 0.808959
I0915 13:44:01.383838 5849 solver.cpp:244] Train net output #0: loss = 0.808959 (* 1 = 0.808959 loss)
I0915 13:44:01.383875 5849 sgd_solver.cpp:106] Iteration 2300, lr = 0.001
I0915 13:44:43.551292 5849 solver.cpp:228] Iteration 2400, loss = 0.776389
I0915 13:44:43.551502 5849 solver.cpp:244] Train net output #0: loss = 0.776389 (* 1 = 0.776389 loss)
I0915 13:44:43.551528 5849 sgd_solver.cpp:106] Iteration 2400, lr = 0.001
F0915 13:45:00.165660 5849 math_functions.cu:79] Check failed: error == cudaSuccess (30 vs. 0) unknown error
*** Check failure stack trace: ***
@ 0x7f2f5a316daa (unknown)
@ 0x7f2f5a316ce4 (unknown)
@ 0x7f2f5a3166e6 (unknown)
@ 0x7f2f5a319687 (unknown)
@ 0x7f2f5aac9458 caffe::caffe_gpu_memcpy()
@ 0x7f2f5aa81efe caffe::SyncedMemory::to_gpu()
@ 0x7f2f5aa813a9 caffe::SyncedMemory::gpu_data()
@ 0x7f2f5a91ab52 caffe::Blob<>::gpu_data()
@ 0x7f2f5aa9e90c caffe::BasePrefetchingDataLayer<>::Forward_gpu()
@ 0x7f2f5a92d8f5 caffe::Net<>::ForwardFromTo()
@ 0x7f2f5a92dc67 caffe::Net<>::Forward()
@ 0x7f2f5aa88b77 caffe::Solver<>::Step()
@ 0x7f2f5aa89439 caffe::Solver<>::Solve()
@ 0x40873b train()
@ 0x405b3c main
@ 0x7f2f59322f45 (unknown)
@ 0x4063ab (unknown)
@ (nil) (unknown)
Aborted (core dumped)
dmesg output:
[ 1461.865204] NVRM: GPU at 0000:05:00.0 has fallen off the bus.
[ 1461.865228] NVRM: GPU is on Board 0321215002951.
[ 1461.865241] NVRM: A GPU crash dump has been created. If possible, please run
[ 1461.865241] NVRM: nvidia-bug-report.sh as root to collect this data before
[ 1461.865241] NVRM: the NVIDIA kernel module is unloaded.
[ 1461.865256] NVRM: GPU at 0000:06:00.0 has fallen off the bus.
[ 1461.865276] NVRM: GPU is on Board 0321215002951.
Please let me know how to fix it.
Thanks,
Arup