Train Caffe Model task failed with error code -9

2018-10-02 19:23:53 [20181002-192352-5c78] [INFO ] Train Caffe Model task started.
2018-10-02 19:23:53 [20181002-192352-5c78] [INFO ] Task subprocess args: “/home/nvidia/caffe/build/tools/caffe train --solver=/home/nvidia/digits/digits/jobs/20181002-192352-5c78/solver.prototxt --gpu=0 --weights=/home/nvidia/jetson-inference/data/networks/bvlc_googlenet.caffemodel”
2018-10-02 19:29:56 [20181002-192352-5c78] [ERROR] Train Caffe Model task failed with error code -9

I’m following the NV Dusty inference tutorial on jetson TX2 and setting up coco-dog model. Anybody know what error code 9 means? … Thanks!

batch size = 5 … I changed this to 2
batch accumulation = 2 … I changed this to 5

The training ran for about 2.5 hours, 4 epochs, and then crashed wit error code -11

Train net output #1: loss_coverage = 23.7879 (* 1 = 23.7879 loss)
Iteration 1392, lr = 2.41098e-05
Iteration 1440 (0.188151 iter/s, 255.114s/48 iter), loss = 13.4069
Train net output #0: loss_bbox = 0 (* 2 = 0 loss)
Train net output #1: loss_coverage = 0.0199222 (* 1 = 0.0199222 loss)
Iteration 1440, lr = 2.40797e-05
Iteration 1488 (0.1882 iter/s, 255.048s/48 iter), loss = 15.3147
Train net output #0: loss_bbox = 3.4747 (* 2 = 6.9494 loss)
Train net output #1: loss_coverage = 10.3565 (* 1 = 10.3565 loss)
Iteration 1488, lr = 2.40496e-05
Iteration 1536 (0.188236 iter/s, 255s/48 iter), loss = 17.0762
Train net output #0: loss_bbox = 1.83626 (* 2 = 3.67252 loss)
Train net output #1: loss_coverage = 10.4931 (* 1 = 10.4931 loss)
Iteration 1536, lr = 2.40195e-05
Snapshotting to binary proto file snapshot_iter_1544.caffemodel
Snapshotting solver state to binary proto file snapshot_iter_1544.solverstate
Iteration 1544, Testing net (#0)
Ignoring source layer train_data
Ignoring source layer train_label
Ignoring source layer train_transform

Train Caffe Model task failed with error code -11

Can I just use / download a completed training for dogs? … And what does the filename look like so i can recognise it? … Thanks.

Hi,

Could you share the detail error log with us?

If you are using python interface, could you also check if this link helps?
https://github.com/NVIDIA/DIGITS/issues/1239

Thanks.

I tried training again and this time it got to epoch 8 before error message 1:

I1005 00:07:17.098057 24253 solver.cpp:261] Train net output #0: loss_bbox = 0.876117 (* 2 = 1.75223 loss)
I1005 00:07:17.098083 24253 solver.cpp:261] Train net output #1: loss_coverage = 15.4955 (* 1 = 15.4955 loss)
I1005 00:07:17.098107 24253 sgd_solver.cpp:106] Iteration 2976, lr = 1.697e-05
I1005 00:11:29.746094 24253 solver.cpp:242] Iteration 3024 (0.189977 iter/s, 252.663s/48 iter), loss = 10.8942
I1005 00:11:29.746258 24253 solver.cpp:261] Train net output #0: loss_bbox = 2.79486 (* 2 = 5.58972 loss)
I1005 00:11:29.746282 24253 solver.cpp:261] Train net output #1: loss_coverage = 1.01521 (* 1 = 1.01521 loss)
I1005 00:11:29.746306 24253 sgd_solver.cpp:106] Iteration 3024, lr = 1.68643e-05
I1005 00:15:42.279007 24253 solver.cpp:242] Iteration 3072 (0.190063 iter/s, 252.547s/48 iter), loss = 11.2451
I1005 00:15:42.279167 24253 solver.cpp:261] Train net output #0: loss_bbox = 2.16051 (* 2 = 4.32101 loss)
I1005 00:15:42.279191 24253 solver.cpp:261] Train net output #1: loss_coverage = 16.4544 (* 1 = 16.4544 loss)
I1005 00:15:42.279214 24253 sgd_solver.cpp:106] Iteration 3072, lr = 1.67592e-05
I1005 00:17:01.308733 24253 solver.cpp:479] Snapshotting to binary proto file snapshot_iter_3088.caffemodel
I1005 00:17:01.877044 24253 sgd_solver.cpp:273] Snapshotting solver state to binary proto file snapshot_iter_3088.solverstate
I1005 00:17:01.971050 24253 solver.cpp:362] Iteration 3088, Testing net (#0)
I1005 00:17:01.971097 24253 net.cpp:723] Ignoring source layer train_data
I1005 00:17:01.971108 24253 net.cpp:723] Ignoring source layer train_label
I1005 00:17:01.971115 24253 net.cpp:723] Ignoring source layer train_transform
OpenCV Error: Assertion failed (mtype == type0 || (((((mtype) & ((512 - 1) << 3)) >> 3) + 1) == 1 && ((1 << type0) & fixedDepthMask) != 0)) in create, file /home/nvidia/build-opencv/opencv/modules/core/src/matrix.cpp, line 2542
OpenCV Error: Assertion failed (mtype == type0 || (((((mtype) & ((512 - 1) << 3)) >> 3) + 1) == 1 && ((1 << type0) & fixedDepthMask) != 0)) in create, file /home/nvidia/build-opencv/opencv/modules/core/src/matrix.cpp, line 2542
Traceback (most recent call last):
File “/home/nvidia/caffe/python/caffe/layers/detectnet/clustering.py”, line 133, in forward
bbox = cluster(self, data0, bottom[1].data)
File “/home/nvidia/caffe/python/caffe/layers/detectnet/clustering.py”, line 227, in cluster
boxes_cur_image = vote_boxes(propose_boxes, propose_cvgs, mask, self)
File “/home/nvidia/caffe/python/caffe/layers/detectnet/clustering.py”, line 193, in vote_boxes
self.gridbox_rect_eps)
cv2.error: /home/nvidia/build-opencv/opencv/modules/core/src/matrix.cpp:2542: error: (-215) mtype == type0 || (((((mtype) & ((512 - 1) << 3)) >> 3) + 1) == 1 && ((1 << type0) & fixedDepthMask) != 0) in function create

Am I getting this error because DIGITS wont train on TX2? My other option is to use the Nvidia cloud system, which should be better, i think?

Could it be due to problems accessing my external HDD? Should I switch to SSD technology?

Hi,

Could you check your free disk amount first?
Just want to make sure this is not caused by out of storage.

Thanks.

I’ve got 5.2 GB free space on Jetson and dog photos are on mechanical USB hard drive.

I found an easy workaround:
eg.

Training epochs = 16
Snapshot interval (in epochs) = 16
Validation interval (in epochs) = 16

I still get an error, but it’s after the snapshot at epoch 16 has been made :)

A video walkthrough of natively installing NVIDIA DIGITS on Ubuntu 18.04 LTS is available here:

https://cudaeducation.com/nvidiadigits/

-Cuda Education
cudaeducation.com