Digits training error

Hi,

Been following the jetson-inference guide but immediately got an error at the beginning of training images, in the DetectNet-COCO-Dog model example. The error is:

ERROR: error code -11
Setting up coverage_loss
Top shape: (1)
with loss weight 1
Memory required for data: 540387272
Creating layer cluster

I installed digits correctly, and just ran some tests and my NVCaffe and cuDNN are also working. I installed OpenCV and CUDA from the newest Jetpack installer and they seem to be working fine. My PC spec is i5 8600K, 32 GB RAM and SSD. Here’s my nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:01:00.0  On |                  N/A |
|  0%   44C    P8     9W / 151W |    700MiB /  8110MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1004      G   /usr/lib/xorg/Xorg                           498MiB |
|    0      1784      G   compiz                                       108MiB |
|    0      2940      C   python2                                       89MiB |
+-----------------------------------------------------------------------------+

Any help is greatly appreciated!

David Huang

Hi,

We have received an issue for error code 11 before.
Please check this link for information: https://github.com/NVIDIA/DIGITS/issues/1239

If no luck with the fix, could you share your caffe log with us?

Thanks.

I tried the method in the link, installed everything but still no luck :(

My digits info on terminal are like this: (the two IOError are related to old jobs I think)

Tensorflow support disabled.
2018-03-26 08:23:31 [INFO ] Loaded 3 jobs.
2018-03-26 08:23:31 [WARNING] Failed to load 2 jobs.
2018-03-26 08:23:31 [DEBUG] 20180325-191919-ec85 - IOError: [Errno 2] No such file or directory: '/home/shuo/digits/digits/jobs/20180325-191919-ec85/status.pickle'
2018-03-26 08:23:31 [DEBUG] 20180326-073040-332a - IOError: [Errno 2] No such file or directory: '/home/shuo/digits/digits/jobs/20180326-073040-332a/status.pickle'
2018-03-26 08:24:03 [20180326-082402-b889] [WARNING] Ignoring data_param.source ...
2018-03-26 08:24:03 [20180326-082402-b889] [WARNING] Ignoring data_param.backend ...
2018-03-26 08:24:03 [20180326-082402-b889] [WARNING] Ignoring data_param.source ...
2018-03-26 08:24:03 [20180326-082402-b889] [WARNING] Ignoring data_param.backend ...
2018-03-26 08:24:03 [20180326-082402-b889] [WARNING] Ignoring data_param.source ...
2018-03-26 08:24:03 [20180326-082402-b889] [WARNING] Ignoring data_param.backend ...
2018-03-26 08:24:03 [20180326-082402-b889] [WARNING] Ignoring data_param.source ...
2018-03-26 08:24:03 [20180326-082402-b889] [WARNING] Ignoring data_param.backend ...
2018-03-26 08:24:03 [20180326-082402-b889] [DEBUG] Network sanity check - train
2018-03-26 08:24:03 [20180326-082402-b889] [DEBUG] Network sanity check - val
2018-03-26 08:24:03 [20180326-082402-b889] [DEBUG] Network sanity check - deploy
2018-03-26 08:24:03 [20180326-082402-b889] [INFO ] Train Caffe Model task started.
2018-03-26 08:24:03 [20180326-082402-b889] [INFO ] Task subprocess args: "/home/shuo/caffe/build/tools/caffe train --solver=/home/shuo/digits/digits/jobs/20180326-082402-b889/solver.prototxt --gpu=0 --weights=/home/shuo/bvlc_googlenet.caffemodel"
2018-03-26 08:24:05 [20180326-082402-b889] [ERROR] Train Caffe Model task failed with error code -11

And the end of my caffe log:

I0326 07:47:01.789369 14042 net.cpp:144] Setting up coverage_loss
I0326 07:47:01.789372 14042 net.cpp:151] Top shape: (1)
I0326 07:47:01.789374 14042 net.cpp:154]     with loss weight 1
I0326 07:47:01.789376 14042 net.cpp:159] Memory required for data: 1089644936
I0326 07:47:01.789378 14042 layer_factory.hpp:77] Creating layer cluster
*** Aborted at 1522064822 (unix time) try "date -d @1522064822" if you are using GNU date ***
PC: @     0x7ff41ce24873 std::_Hashtable<>::clear()
*** SIGSEGV (@0x9) received by PID 14042 (TID 0x7ff444e4f740) from PID 9; stack trace: ***
@     0x7ff442a524b0 (unknown)
@     0x7ff41ce24873 std::_Hashtable<>::clear()
@     0x7ff41ce16346 google::protobuf::DescriptorPool::FindFileByName()
@     0x7ff41cdf4ac8 google::protobuf::python::cdescriptor_pool::AddSerializedFile()
@     0x7ff44367d9f0 PyEval_EvalFrameEx
@     0x7ff4437b305c PyEval_EvalCodeEx
@     0x7ff44370946d (unknown)
@     0x7ff4436dc273 PyObject_Call
@     0x7ff4436fcb75 (unknown)
@     0x7ff443693173 (unknown)
@     0x7ff4436dc273 PyObject_Call
@     0x7ff44367a35c PyEval_EvalFrameEx
@     0x7ff4437b305c PyEval_EvalCodeEx
@     0x7ff443674da9 PyEval_EvalCode
@     0x7ff443716244 PyImport_ExecCodeModuleEx
@     0x7ff443716c1f (unknown)
@     0x7ff443718390 (unknown)
@     0x7ff443718658 (unknown)
@     0x7ff44371976b PyImport_ImportModuleLevel
@     0x7ff4436838b8 (unknown)
@     0x7ff4436dc273 PyObject_Call
@     0x7ff4437b2487 PyEval_CallObjectWithKeywords
@     0x7ff4436787e6 PyEval_EvalFrameEx
@     0x7ff4437b305c PyEval_EvalCodeEx
@     0x7ff443674da9 PyEval_EvalCode
@     0x7ff443716244 PyImport_ExecCodeModuleEx
@     0x7ff443716c1f (unknown)
@     0x7ff443718390 (unknown)
@     0x7ff443718658 (unknown)
@     0x7ff44371976b PyImport_ImportModuleLevel
@     0x7ff4436838b8 (unknown)
@     0x7ff4436dc273 PyObject_Call

Thanks!

Hi,

Looks like this issue is from Caffe:
https://github.com/BVLC/caffe/issues/5357

Try this command:

pip install --user --upgrade protobuf==3.1.0.post1

Thanks! That solution worked, but I’m getting errors during training too, always at around 13%…

[ERROR] Train Caffe Model task failed with error code 1

Log file:

OpenCV Error: The function/feature is not implemented (Unknown/unsupported array type) in type, file /home/om/test/opencv/opencv/modules/core/src/matrix.cpp, line 2034
OpenCV Error: The function/feature is not implemented (Unknown/unsupported array type) in type, file /home/om/test/opencv/opencv/modules/core/src/matrix.cpp, line 2034
Traceback (most recent call last):
File "/home/shuo/caffe/python/caffe/layers/detectnet/clustering.py", line 133, in forward
bbox = cluster(self, data0, bottom[1].data)
File "/home/shuo/caffe/python/caffe/layers/detectnet/clustering.py", line 227, in cluster
boxes_cur_image = vote_boxes(propose_boxes, propose_cvgs, mask, self)
File "/home/shuo/caffe/python/caffe/layers/detectnet/clustering.py", line 193, in vote_boxes
self.gridbox_rect_eps)
cv2.error: /home/om/test/opencv/opencv/modules/core/src/matrix.cpp:2034: error: (-213) Unknown/unsupported array type in function type

It looks like an openCV problem. I searched on google but didn’t get any solutions. I installed OpenCV 3 by the Jetpack installer, and python can successfully import cv2 also, so not sure where the problem is.

Thanks again for all your help!

Tried a lot of things but still stuck on this error :(

Hi,

Suppose you are running DIGITs on an x86 Linux environment.
You need to install OpenCV by yourself since the package contained in JetPack is for aarch64 system.

If you have well-installed the OpenCV for host, please try this command:

sudo pip install opencv-python

After that, please rebuild the NvCaffe library.

Thanks.

Thanks. I have the same problem with Detectnet-COCO-dog. I am trying your solution as we speak

Still bombs out with Error 11

I am using protobuf 3.5

My caffe output file
/home/frank/Pictures/Screenshot from 2018-03-29 19-49-21.png

frank@UBUNTU-DT:~/DIGITS$ protoc --version
libprotoc 3.5.1

What worked for me is that I deleted the protobuf folder completely, then got version 3.1 by this command:

git clone https://github.com/google/protobuf.git $PROTOBUF_ROOT -b '3.1.x'

After that I repeated the steps installing protobuf, and no error 11 anymore.

This finally fixed all my Digit problem :D

Thanks so much for all your help! ^^

Good to know this : )

Hi,AastaLLL.I encountered the same error. I dont’t know what’s with my DIGITS system. Could you please help me solve the problem? Thanks.

Tensorflow support disabled.
2019-09-26 19:08:03 [INFO ] Loaded 0 jobs.
2019-09-26 19:08:03 [WARNING] Failed to load 1 jobs.
2019-09-26 19:08:03 [DEBUG] 20190926-164454-2cfb - IOError: [Errno 2] No such file or directory: '/home/zhaoyu/DIGITS/digits/jobs/20190926-164454-2cfb/status.pickle'
2019-09-26 19:09:44 [20190926-190943-36cc] [INFO ] Parse Folder (train/val) task started.
2019-09-26 19:09:44 [20190926-190943-36cc] [INFO ] Task subprocess args: "/usr/bin/python2 /home/zhaoyu/DIGITS/digits/tools/parse_folder.py /home/zhaoyu/mnist/train /home/zhaoyu/DIGITS/digits/jobs/20190926-190943-36cc/labels.txt --min=2 --train_file=/home/zhaoyu/DIGITS/digits/jobs/20190926-190943-36cc/train.txt --val_file=/home/zhaoyu/DIGITS/digits/jobs/20190926-190943-36cc/val.txt --percent_val=25.0"
2019-09-26 19:09:45 [20190926-190943-36cc] [WARNING] Parse Folder (train/val) unrecognized output: Tensorflow support disabled.
*** Error in `/usr/bin/python2': free(): invalid pointer: 0x0000000001b45940 ***
======= Backtrace: =========

Let’s follow up this issue on the topic 1063876 directly:
https://devtalk.nvidia.com/default/topic/1063876/jetson-agx-xavier/digits-system-setup-installing-nvcaffe-on-the-host-problem/

Thanks