Digits 5 netoworks that used to train now fail with Error -11

I have a digits 5 install and I’ve been using it for training on several networks. After not using the digits install for months networks that used to work no longer do. No hardware has changed, but there have been the normal updates to ubuntu software and nvidia software in that time. I do not know how to fix this. I’ve uninstalled and reinstalled digits and cuda, but I still have the same problem. Most networks that have green Done status from previous trainings do not work. Only the smallest ones do. All networks that do fail do so with Error -11. I don’t see how I could be out of memory now. I know those networks ran on this hardware before, it shows me they completed in January, but now in June they fail. This is the tail of output from my caffe.log for one of the failed networks. Any help is greatly appreciated.

I0608 12:26:23.401456 5603 solver.cpp:304] Solving
I0608 12:26:23.401460 5603 solver.cpp:305] Learning Rate Policy: exp
I0608 12:26:23.405985 5603 solver.cpp:362] Iteration 0, Testing net (#0)
I0608 12:26:23.406023 5603 net.cpp:723] Ignoring source layer train_data
I0608 12:26:23.406028 5603 net.cpp:723] Ignoring source layer train_label
I0608 12:26:23.406031 5603 net.cpp:723] Ignoring source layer train_transform
*** Aborted at 1528478783 (unix time) try “date -d @1528478783” if you are using GNU date ***
PC: @ 0x7f000bcad865 cv::Mat::copyTo()
*** SIGSEGV (@0x0) received by PID 5603 (TID 0x7f0021cf6b00) from PID 0; stack trace: ***
@ 0x7f001f5d04b0 (unknown)
@ 0x7f000bcad865 cv::Mat::copyTo()
@ 0x7efe2ad234c5 pyopencv_from<>()
@ 0x7efe2ae72d2a pyopencv_cv_groupRectangles()
@ 0x7f002020d971 PyEval_EvalFrameEx
@ 0x7f002020c044 PyEval_EvalFrameEx
@ 0x7f002020c044 PyEval_EvalFrameEx
@ 0x7f002034305c PyEval_EvalCodeEx
@ 0x7f0020299370 (unknown)
@ 0x7f002026c273 PyObject_Call
@ 0x7f00202e03ac (unknown)
@ 0x7f002026c273 PyObject_Call
@ 0x7f0020342487 PyEval_CallObjectWithKeywords
@ 0x7f00202a0fa7 PyEval_CallFunction
@ 0x7eff5e0dd9cc caffe::Layer<>::Forward_gpu()
@ 0x7f002124fe42 caffe::Net<>::ForwardFromTo()
@ 0x7f002124ff67 caffe::Net<>::Forward()
@ 0x7f0021243c7a caffe::Solver<>::Test()
@ 0x7f00212447ce caffe::Solver<>::TestAll()
@ 0x7f00212475e9 caffe::Solver<>::Step()
@ 0x7f0021248339 caffe::Solver<>::Solve()
@ 0x40c6d7 train()
@ 0x408668 main
@ 0x7f001f5bb830 __libc_start_main
@ 0x408dd9 _start
@ 0x0 (unknown)

Somehow python-opencv was missing. But I cannot fathom why some networks worked and other did not. The only change was the number of images going into the network, the pretrained network to start from, and image vs pixel mean subtraction. Old small networks would still train, but large ones would not when python-opencv was missing…

A video walkthrough of natively installing NVIDIA DIGITS on Ubuntu 18.04 LTS is available here:


-Cuda Education