Digits training error

uwdaveh · March 26, 2018, 12:50am

Hi,

Been following the jetson-inference guide but immediately got an error at the beginning of training images, in the DetectNet-COCO-Dog model example. The error is:

ERROR: error code -11
Setting up coverage_loss
Top shape: (1)
with loss weight 1
Memory required for data: 540387272
Creating layer cluster

I installed digits correctly, and just ran some tests and my NVCaffe and cuDNN are also working. I installed OpenCV and CUDA from the newest Jetpack installer and they seem to be working fine. My PC spec is i5 8600K, 32 GB RAM and SSD. Here’s my nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:01:00.0  On |                  N/A |
|  0%   44C    P8     9W / 151W |    700MiB /  8110MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1004      G   /usr/lib/xorg/Xorg                           498MiB |
|    0      1784      G   compiz                                       108MiB |
|    0      2940      C   python2                                       89MiB |
+-----------------------------------------------------------------------------+

Any help is greatly appreciated!

David Huang

AastaLLL · March 26, 2018, 6:26am

Hi,

We have received an issue for error code 11 before.
Please check this link for information: https://github.com/NVIDIA/DIGITS/issues/1239

If no luck with the fix, could you share your caffe log with us?

Thanks.

uwdaveh · March 26, 2018, 12:45pm

I tried the method in the link, installed everything but still no luck :(

My digits info on terminal are like this: (the two IOError are related to old jobs I think)

Tensorflow support disabled.
2018-03-26 08:23:31 [INFO ] Loaded 3 jobs.
2018-03-26 08:23:31 [WARNING] Failed to load 2 jobs.
2018-03-26 08:23:31 [DEBUG] 20180325-191919-ec85 - IOError: [Errno 2] No such file or directory: '/home/shuo/digits/digits/jobs/20180325-191919-ec85/status.pickle'
2018-03-26 08:23:31 [DEBUG] 20180326-073040-332a - IOError: [Errno 2] No such file or directory: '/home/shuo/digits/digits/jobs/20180326-073040-332a/status.pickle'
2018-03-26 08:24:03 [20180326-082402-b889] [WARNING] Ignoring data_param.source ...
2018-03-26 08:24:03 [20180326-082402-b889] [WARNING] Ignoring data_param.backend ...
2018-03-26 08:24:03 [20180326-082402-b889] [WARNING] Ignoring data_param.source ...
2018-03-26 08:24:03 [20180326-082402-b889] [WARNING] Ignoring data_param.backend ...
2018-03-26 08:24:03 [20180326-082402-b889] [WARNING] Ignoring data_param.source ...
2018-03-26 08:24:03 [20180326-082402-b889] [WARNING] Ignoring data_param.backend ...
2018-03-26 08:24:03 [20180326-082402-b889] [WARNING] Ignoring data_param.source ...
2018-03-26 08:24:03 [20180326-082402-b889] [WARNING] Ignoring data_param.backend ...
2018-03-26 08:24:03 [20180326-082402-b889] [DEBUG] Network sanity check - train
2018-03-26 08:24:03 [20180326-082402-b889] [DEBUG] Network sanity check - val
2018-03-26 08:24:03 [20180326-082402-b889] [DEBUG] Network sanity check - deploy
2018-03-26 08:24:03 [20180326-082402-b889] [INFO ] Train Caffe Model task started.
2018-03-26 08:24:03 [20180326-082402-b889] [INFO ] Task subprocess args: "/home/shuo/caffe/build/tools/caffe train --solver=/home/shuo/digits/digits/jobs/20180326-082402-b889/solver.prototxt --gpu=0 --weights=/home/shuo/bvlc_googlenet.caffemodel"
2018-03-26 08:24:05 [20180326-082402-b889] [ERROR] Train Caffe Model task failed with error code -11

And the end of my caffe log:

I0326 07:47:01.789369 14042 net.cpp:144] Setting up coverage_loss
I0326 07:47:01.789372 14042 net.cpp:151] Top shape: (1)
I0326 07:47:01.789374 14042 net.cpp:154]     with loss weight 1
I0326 07:47:01.789376 14042 net.cpp:159] Memory required for data: 1089644936
I0326 07:47:01.789378 14042 layer_factory.hpp:77] Creating layer cluster
*** Aborted at 1522064822 (unix time) try "date -d @1522064822" if you are using GNU date ***
PC: @     0x7ff41ce24873 std::_Hashtable<>::clear()
*** SIGSEGV (@0x9) received by PID 14042 (TID 0x7ff444e4f740) from PID 9; stack trace: ***
@     0x7ff442a524b0 (unknown)
@     0x7ff41ce24873 std::_Hashtable<>::clear()
@     0x7ff41ce16346 google::protobuf::DescriptorPool::FindFileByName()
@     0x7ff41cdf4ac8 google::protobuf::python::cdescriptor_pool::AddSerializedFile()
@     0x7ff44367d9f0 PyEval_EvalFrameEx
@     0x7ff4437b305c PyEval_EvalCodeEx
@     0x7ff44370946d (unknown)
@     0x7ff4436dc273 PyObject_Call
@     0x7ff4436fcb75 (unknown)
@     0x7ff443693173 (unknown)
@     0x7ff4436dc273 PyObject_Call
@     0x7ff44367a35c PyEval_EvalFrameEx
@     0x7ff4437b305c PyEval_EvalCodeEx
@     0x7ff443674da9 PyEval_EvalCode
@     0x7ff443716244 PyImport_ExecCodeModuleEx
@     0x7ff443716c1f (unknown)
@     0x7ff443718390 (unknown)
@     0x7ff443718658 (unknown)
@     0x7ff44371976b PyImport_ImportModuleLevel
@     0x7ff4436838b8 (unknown)
@     0x7ff4436dc273 PyObject_Call
@     0x7ff4437b2487 PyEval_CallObjectWithKeywords
@     0x7ff4436787e6 PyEval_EvalFrameEx
@     0x7ff4437b305c PyEval_EvalCodeEx
@     0x7ff443674da9 PyEval_EvalCode
@     0x7ff443716244 PyImport_ExecCodeModuleEx
@     0x7ff443716c1f (unknown)
@     0x7ff443718390 (unknown)
@     0x7ff443718658 (unknown)
@     0x7ff44371976b PyImport_ImportModuleLevel
@     0x7ff4436838b8 (unknown)
@     0x7ff4436dc273 PyObject_Call

Thanks!

AastaLLL · March 27, 2018, 7:49am

Hi,

Looks like this issue is from Caffe:
https://github.com/BVLC/caffe/issues/5357

Try this command:

pip install --user --upgrade protobuf==3.1.0.post1

uwdaveh · March 28, 2018, 2:07am

Thanks! That solution worked, but I’m getting errors during training too, always at around 13%…

[ERROR] Train Caffe Model task failed with error code 1

Log file:

OpenCV Error: The function/feature is not implemented (Unknown/unsupported array type) in type, file /home/om/test/opencv/opencv/modules/core/src/matrix.cpp, line 2034
OpenCV Error: The function/feature is not implemented (Unknown/unsupported array type) in type, file /home/om/test/opencv/opencv/modules/core/src/matrix.cpp, line 2034
Traceback (most recent call last):
File "/home/shuo/caffe/python/caffe/layers/detectnet/clustering.py", line 133, in forward
bbox = cluster(self, data0, bottom[1].data)
File "/home/shuo/caffe/python/caffe/layers/detectnet/clustering.py", line 227, in cluster
boxes_cur_image = vote_boxes(propose_boxes, propose_cvgs, mask, self)
File "/home/shuo/caffe/python/caffe/layers/detectnet/clustering.py", line 193, in vote_boxes
self.gridbox_rect_eps)
cv2.error: /home/om/test/opencv/opencv/modules/core/src/matrix.cpp:2034: error: (-213) Unknown/unsupported array type in function type

It looks like an openCV problem. I searched on google but didn’t get any solutions. I installed OpenCV 3 by the Jetpack installer, and python can successfully import cv2 also, so not sure where the problem is.

Thanks again for all your help!

uwdaveh · March 28, 2018, 8:07pm

Tried a lot of things but still stuck on this error :(

AastaLLL · March 29, 2018, 7:00am

Hi,

Suppose you are running DIGITs on an x86 Linux environment.
You need to install OpenCV by yourself since the package contained in JetPack is for aarch64 system.

If you have well-installed the OpenCV for host, please try this command:

sudo pip install opencv-python

After that, please rebuild the NvCaffe library.

Thanks.

francisdomoney · March 29, 2018, 6:09pm

Thanks. I have the same problem with Detectnet-COCO-dog. I am trying your solution as we speak

francisdomoney · March 29, 2018, 6:19pm

Still bombs out with Error 11

francisdomoney · March 29, 2018, 6:51pm

I am using protobuf 3.5

My caffe output file
/home/frank/Pictures/Screenshot from 2018-03-29 19-49-21.png

francisdomoney · March 29, 2018, 6:53pm

frank@UBUNTU-DT:~/DIGITS$ protoc --version
libprotoc 3.5.1

uwdaveh · March 29, 2018, 11:27pm

What worked for me is that I deleted the protobuf folder completely, then got version 3.1 by this command:

git clone https://github.com/google/protobuf.git $PROTOBUF_ROOT -b '3.1.x'

After that I repeated the steps installing protobuf, and no error 11 anymore.

uwdaveh · March 29, 2018, 11:30pm

AastaLLL:

Hi,

Suppose you are running DIGITs on an x86 Linux environment.
You need to install OpenCV by yourself since the package contained in JetPack is for aarch64 system.

If you have well-installed the OpenCV for host, please try this command:
sudo pip install opencv-python
After that, please rebuild the NvCaffe library.

Thanks.

This finally fixed all my Digit problem :D

Thanks so much for all your help! ^^

AastaLLL · March 30, 2018, 2:16am

Good to know this : )

13126678366 · September 26, 2019, 11:43am

Hi,AastaLLL.I encountered the same error. I dont’t know what’s with my DIGITS system. Could you please help me solve the problem? Thanks.

Tensorflow support disabled.
2019-09-26 19:08:03 [INFO ] Loaded 0 jobs.
2019-09-26 19:08:03 [WARNING] Failed to load 1 jobs.
2019-09-26 19:08:03 [DEBUG] 20190926-164454-2cfb - IOError: [Errno 2] No such file or directory: '/home/zhaoyu/DIGITS/digits/jobs/20190926-164454-2cfb/status.pickle'
2019-09-26 19:09:44 [20190926-190943-36cc] [INFO ] Parse Folder (train/val) task started.
2019-09-26 19:09:44 [20190926-190943-36cc] [INFO ] Task subprocess args: "/usr/bin/python2 /home/zhaoyu/DIGITS/digits/tools/parse_folder.py /home/zhaoyu/mnist/train /home/zhaoyu/DIGITS/digits/jobs/20190926-190943-36cc/labels.txt --min=2 --train_file=/home/zhaoyu/DIGITS/digits/jobs/20190926-190943-36cc/train.txt --val_file=/home/zhaoyu/DIGITS/digits/jobs/20190926-190943-36cc/val.txt --percent_val=25.0"
2019-09-26 19:09:45 [20190926-190943-36cc] [WARNING] Parse Folder (train/val) unrecognized output: Tensorflow support disabled.
*** Error in `/usr/bin/python2': free(): invalid pointer: 0x0000000001b45940 ***
======= Backtrace: =========

AastaLLL · October 16, 2019, 7:12am

Let’s follow up this issue on the topic 1063876 directly:
[url]https://devtalk.nvidia.com/default/topic/1063876/jetson-agx-xavier/digits-system-setup-installing-nvcaffe-on-the-host-problem/[/url]

Thanks

Topic		Replies	Views
Jetson Inference DetectNet Problems Jetson Nano tensorrt , jetson-inference , nvbugs	17	2666	October 15, 2021
Train Caffe Model task failed with error code -9 Jetson TX2	10	814	October 18, 2021
Jetson nano start the Docker an error occurred while training your detection model ：Segmentation fault (core dumped) Jetson Nano jetson-inference	7	1233	April 21, 2022
Create Object Detection Model without DIGITS? Jetson TX2	25	3281	October 18, 2021
Build OpenCV 3.4 with CUDA on NVIDIA Jetson TX2 Jetson TX2 opencv	14	9527	October 18, 2021
DetectNet Tutorial Problem - OpenCV 3? Jetson TX2	16	1553	October 18, 2021
Caffe failed with py-faster-rcnn demo.py on TX1 Jetson TX1	17	14415	February 1, 2018
sdkmanager installation of jetpack 4.2 fails Jetson TX2 opencv	16	3472	October 18, 2021
Failed install Computer Vision (JetPack 4.2.1) on ubuntu 18.04 host using SDK Manager 0.9.13.4763 Jetson Nano opencv	1	1023	August 19, 2019
How to install Opencv 4.0 on Jetson TX2 with jetpack 4.2 Jetson TX2 opencv	28	15257	October 18, 2021

Digits training error

Related topics