Train Caffe Model task failed with error code -9

TegwynTwmffat · October 2, 2018, 6:49pm

2018-10-02 19:23:53 [20181002-192352-5c78] [INFO ] Train Caffe Model task started.
2018-10-02 19:23:53 [20181002-192352-5c78] [INFO ] Task subprocess args: “/home/nvidia/caffe/build/tools/caffe train --solver=/home/nvidia/digits/digits/jobs/20181002-192352-5c78/solver.prototxt --gpu=0 --weights=/home/nvidia/jetson-inference/data/networks/bvlc_googlenet.caffemodel”
2018-10-02 19:29:56 [20181002-192352-5c78] [ERROR] Train Caffe Model task failed with error code -9

I’m following the NV Dusty inference tutorial on jetson TX2 and setting up coco-dog model. Anybody know what error code 9 means? … Thanks!

TegwynTwmffat · October 2, 2018, 9:58pm

batch size = 5 … I changed this to 2
batch accumulation = 2 … I changed this to 5

The training ran for about 2.5 hours, 4 epochs, and then crashed wit error code -11

Train net output #1: loss_coverage = 23.7879 (* 1 = 23.7879 loss)
Iteration 1392, lr = 2.41098e-05
Iteration 1440 (0.188151 iter/s, 255.114s/48 iter), loss = 13.4069
Train net output #0: loss_bbox = 0 (* 2 = 0 loss)
Train net output #1: loss_coverage = 0.0199222 (* 1 = 0.0199222 loss)
Iteration 1440, lr = 2.40797e-05
Iteration 1488 (0.1882 iter/s, 255.048s/48 iter), loss = 15.3147
Train net output #0: loss_bbox = 3.4747 (* 2 = 6.9494 loss)
Train net output #1: loss_coverage = 10.3565 (* 1 = 10.3565 loss)
Iteration 1488, lr = 2.40496e-05
Iteration 1536 (0.188236 iter/s, 255s/48 iter), loss = 17.0762
Train net output #0: loss_bbox = 1.83626 (* 2 = 3.67252 loss)
Train net output #1: loss_coverage = 10.4931 (* 1 = 10.4931 loss)
Iteration 1536, lr = 2.40195e-05
Snapshotting to binary proto file snapshot_iter_1544.caffemodel
Snapshotting solver state to binary proto file snapshot_iter_1544.solverstate
Iteration 1544, Testing net (#0)
Ignoring source layer train_data
Ignoring source layer train_label
Ignoring source layer train_transform

Train Caffe Model task failed with error code -11

Can I just use / download a completed training for dogs? … And what does the filename look like so i can recognise it? … Thanks.

AastaLLL · October 3, 2018, 5:47am

Hi,

Could you share the detail error log with us?

If you are using python interface, could you also check if this link helps?
[url]https://github.com/NVIDIA/DIGITS/issues/1239[/url]

Thanks.

TegwynTwmffat · October 5, 2018, 7:43am

I tried training again and this time it got to epoch 8 before error message 1:

I1005 00:07:17.098057 24253 solver.cpp:261] Train net output #0: loss_bbox = 0.876117 (* 2 = 1.75223 loss)
I1005 00:07:17.098083 24253 solver.cpp:261] Train net output #1: loss_coverage = 15.4955 (* 1 = 15.4955 loss)
I1005 00:07:17.098107 24253 sgd_solver.cpp:106] Iteration 2976, lr = 1.697e-05
I1005 00:11:29.746094 24253 solver.cpp:242] Iteration 3024 (0.189977 iter/s, 252.663s/48 iter), loss = 10.8942
I1005 00:11:29.746258 24253 solver.cpp:261] Train net output #0: loss_bbox = 2.79486 (* 2 = 5.58972 loss)
I1005 00:11:29.746282 24253 solver.cpp:261] Train net output #1: loss_coverage = 1.01521 (* 1 = 1.01521 loss)
I1005 00:11:29.746306 24253 sgd_solver.cpp:106] Iteration 3024, lr = 1.68643e-05
I1005 00:15:42.279007 24253 solver.cpp:242] Iteration 3072 (0.190063 iter/s, 252.547s/48 iter), loss = 11.2451
I1005 00:15:42.279167 24253 solver.cpp:261] Train net output #0: loss_bbox = 2.16051 (* 2 = 4.32101 loss)
I1005 00:15:42.279191 24253 solver.cpp:261] Train net output #1: loss_coverage = 16.4544 (* 1 = 16.4544 loss)
I1005 00:15:42.279214 24253 sgd_solver.cpp:106] Iteration 3072, lr = 1.67592e-05
I1005 00:17:01.308733 24253 solver.cpp:479] Snapshotting to binary proto file snapshot_iter_3088.caffemodel
I1005 00:17:01.877044 24253 sgd_solver.cpp:273] Snapshotting solver state to binary proto file snapshot_iter_3088.solverstate
I1005 00:17:01.971050 24253 solver.cpp:362] Iteration 3088, Testing net (#0)
I1005 00:17:01.971097 24253 net.cpp:723] Ignoring source layer train_data
I1005 00:17:01.971108 24253 net.cpp:723] Ignoring source layer train_label
I1005 00:17:01.971115 24253 net.cpp:723] Ignoring source layer train_transform
OpenCV Error: Assertion failed (mtype == type0 || (((((mtype) & ((512 - 1) << 3)) >> 3) + 1) == 1 && ((1 << type0) & fixedDepthMask) != 0)) in create, file /home/nvidia/build-opencv/opencv/modules/core/src/matrix.cpp, line 2542
OpenCV Error: Assertion failed (mtype == type0 || (((((mtype) & ((512 - 1) << 3)) >> 3) + 1) == 1 && ((1 << type0) & fixedDepthMask) != 0)) in create, file /home/nvidia/build-opencv/opencv/modules/core/src/matrix.cpp, line 2542
Traceback (most recent call last):
File “/home/nvidia/caffe/python/caffe/layers/detectnet/clustering.py”, line 133, in forward
bbox = cluster(self, data0, bottom[1].data)
File “/home/nvidia/caffe/python/caffe/layers/detectnet/clustering.py”, line 227, in cluster
boxes_cur_image = vote_boxes(propose_boxes, propose_cvgs, mask, self)
File “/home/nvidia/caffe/python/caffe/layers/detectnet/clustering.py”, line 193, in vote_boxes
self.gridbox_rect_eps)
cv2.error: /home/nvidia/build-opencv/opencv/modules/core/src/matrix.cpp:2542: error: (-215) mtype == type0 || (((((mtype) & ((512 - 1) << 3)) >> 3) + 1) == 1 && ((1 << type0) & fixedDepthMask) != 0) in function create

TegwynTwmffat · October 5, 2018, 8:11am

Am I getting this error because DIGITS wont train on TX2? My other option is to use the Nvidia cloud system, which should be better, i think?

TegwynTwmffat · October 8, 2018, 12:07pm

Could it be due to problems accessing my external HDD? Should I switch to SSD technology?

AastaLLL · October 11, 2018, 8:15am

Hi,

Could you check your free disk amount first?
Just want to make sure this is not caused by out of storage.

Thanks.

TegwynTwmffat · October 11, 2018, 9:10pm

I’ve got 5.2 GB free space on Jetson and dog photos are on mechanical USB hard drive.

TegwynTwmffat · October 14, 2018, 9:33pm

I found an easy workaround:
eg.

Training epochs = 16
Snapshot interval (in epochs) = 16
Validation interval (in epochs) = 16

I still get an error, but it’s after the snapshot at epoch 16 has been made :)

cudaeducation · April 16, 2019, 8:56am

A video walkthrough of natively installing NVIDIA DIGITS on Ubuntu 18.04 LTS is available here:

-Cuda Education

Topic		Replies	Views
Digits training error Jetson TX1	16	2348	October 18, 2021
Caffe failed with py-faster-rcnn demo.py on TX1 Jetson TX1	17	14419	February 1, 2018
Problem Deploy trained model from DIGITS Jetson TX2	3	610	October 18, 2021
Nvidia Tegra X2 Jetson TX2	2	665	October 18, 2021
VGG caffe is not working Jetson TX2	3	501	April 8, 2019
DetectNet-COCO-Dog error Jetson TX1	4	761	October 18, 2021
Problems with train_ssd.py Jetson Nano	2	1018	October 14, 2021
Train_ssd.py error - Training Object Detection Models Jetson Nano ai-training	10	1415	October 6, 2022
Jetson nano start the Docker an error occurred while training your detection model ：Segmentation fault (core dumped) Jetson Nano jetson-inference	7	1234	April 21, 2022
Train_ssd.py indices error Jetson Nano jetson-inference	12	1720	December 15, 2021

Train Caffe Model task failed with error code -9

Related topics