Hello!
Finally, I’m processing my photos on the AWS GPU and it appears to be working, with 2,700 photos it rattles through 300 epochs in just over 2 hours :)
However, the graph only displays loss bbox (train) and loss coverage (train) and nothing else. It would be really useful to see some accuracy stats eg mAP etc.
I’m using nvcr.io/nvidia/digits:18.10. Does version 18.11 fix this problem or is there something else I’m missing here? Is it because the training is just not converging at all?
Here’s the final section of the café log: … Thanks!
I1125 16:04:01.751603 122 solver.cpp:333] [0.0] Iteration 63102 (8.37129 iter/s, 3.10585s/26 iter), 299.2/300.1ep, loss = 62.6671
I1125 16:04:01.751641 122 solver.cpp:361] [0.0] Train net output #0: loss_bbox = 1.33363 (* 2 = 2.66725 loss)
I1125 16:04:01.751652 122 solver.cpp:361] [0.0] Train net output #1: loss_coverage = 28.2475 (* 1 = 28.2475 loss)
I1125 16:04:01.751662 122 sgd_solver.cpp:180] [0.0] Iteration 63102, lr = 0.00133686, m = 0.9, lrm = 0.0133686, wd = 0.0001, gs = 1
I1125 16:04:04.856482 122 solver.cpp:333] [0.0] Iteration 63128 (8.37389 iter/s, 3.10489s/26 iter), 299.3/300.1ep, loss = 47.7302
I1125 16:04:04.856518 122 solver.cpp:361] [0.0] Train net output #0: loss_bbox = 0.612215 (* 2 = 1.22443 loss)
I1125 16:04:04.856529 122 solver.cpp:361] [0.0] Train net output #1: loss_coverage = 14.7535 (* 1 = 14.7535 loss)
I1125 16:04:04.856539 122 sgd_solver.cpp:180] [0.0] Iteration 63128, lr = 0.00133575, m = 0.9, lrm = 0.0133575, wd = 0.0001, gs = 1
I1125 16:04:07.997928 122 solver.cpp:333] [0.0] Iteration 63154 (8.27649 iter/s, 3.14143s/26 iter), 299.4/300.1ep, loss = 65.7498
I1125 16:04:07.998165 122 solver.cpp:361] [0.0] Train net output #0: loss_bbox = 9.23087 (* 2 = 18.4617 loss)
I1125 16:04:07.998178 122 solver.cpp:361] [0.0] Train net output #1: loss_coverage = 15.5358 (* 1 = 15.5358 loss)
I1125 16:04:07.998188 122 sgd_solver.cpp:180] [0.0] Iteration 63154, lr = 0.00133465, m = 0.9, lrm = 0.0133465, wd = 0.0001, gs = 1
I1125 16:04:11.123752 122 solver.cpp:333] [0.0] Iteration 63180 (8.3178 iter/s, 3.12583s/26 iter), 299.6/300.1ep, loss = 86.7913
I1125 16:04:11.123792 122 solver.cpp:361] [0.0] Train net output #0: loss_bbox = 3.1863 (* 2 = 6.3726 loss)
I1125 16:04:11.123802 122 solver.cpp:361] [0.0] Train net output #1: loss_coverage = 48.6664 (* 1 = 48.6664 loss)
I1125 16:04:11.123812 122 sgd_solver.cpp:180] [0.0] Iteration 63180, lr = 0.00133354, m = 0.9, lrm = 0.0133354, wd = 0.0001, gs = 1
I1125 16:04:14.281015 122 solver.cpp:333] [0.0] Iteration 63206 (8.23498 iter/s, 3.15726s/26 iter), 299.7/300.1ep, loss = 48.9389
I1125 16:04:14.281065 122 solver.cpp:361] [0.0] Train net output #0: loss_bbox = 0.586028 (* 2 = 1.17206 loss)
I1125 16:04:14.281081 122 solver.cpp:361] [0.0] Train net output #1: loss_coverage = 16.0145 (* 1 = 16.0145 loss)
I1125 16:04:14.281095 122 sgd_solver.cpp:180] [0.0] Iteration 63206, lr = 0.00133244, m = 0.9, lrm = 0.0133244, wd = 0.0001, gs = 1
I1125 16:04:17.436023 122 solver.cpp:333] [0.0] Iteration 63232 (8.24083 iter/s, 3.15502s/26 iter), 299.8/300.1ep, loss = 62.0538
I1125 16:04:17.436064 122 solver.cpp:361] [0.0] Train net output #0: loss_bbox = 1.52028 (* 2 = 3.04057 loss)
I1125 16:04:17.436074 122 solver.cpp:361] [0.0] Train net output #1: loss_coverage = 27.2609 (* 1 = 27.2609 loss)
I1125 16:04:17.436085 122 sgd_solver.cpp:180] [0.0] Iteration 63232, lr = 0.00133133, m = 0.9, lrm = 0.0133133, wd = 0.0001, gs = 1
I1125 16:04:18.395835 164 data_reader.cpp:321] Restarting data pre-fetching
I1125 16:04:18.546618 166 data_reader.cpp:321] Restarting data pre-fetching
I1125 16:04:20.560073 122 solver.cpp:333] [0.0] Iteration 63258 (8.32254 iter/s, 3.12405s/26 iter), 299.9/300.1ep, loss = 47.1324
I1125 16:04:20.560114 122 solver.cpp:361] [0.0] Train net output #0: loss_bbox = 0.808282 (* 2 = 1.61656 loss)
I1125 16:04:20.560124 122 solver.cpp:361] [0.0] Train net output #1: loss_coverage = 13.7635 (* 1 = 13.7635 loss)
I1125 16:04:20.560135 122 sgd_solver.cpp:180] [0.0] Iteration 63258, lr = 0.00133023, m = 0.9, lrm = 0.0133023, wd = 0.0001, gs = 1
I1125 16:04:23.663071 122 solver.cpp:333] [0.0] Iteration 63284 (8.37897 iter/s, 3.10301s/26 iter), 300.1/300.1ep, loss = 53.6399
I1125 16:04:23.663107 122 solver.cpp:361] [0.0] Train net output #0: loss_bbox = 1.36236 (* 2 = 2.72472 loss)
I1125 16:04:23.663117 122 solver.cpp:361] [0.0] Train net output #1: loss_coverage = 19.1629 (* 1 = 19.1629 loss)
I1125 16:04:23.663127 122 sgd_solver.cpp:180] [0.0] Iteration 63284, lr = 0.00132913, m = 0.9, lrm = 0.0132913, wd = 0.0001, gs = 1
I1125 16:04:25.453852 122 solver.cpp:333] [0.0] Iteration 63300 (8.37628 iter/s, 1.79077s/15 iter), 300.1/300.1ep, loss = 68.1711
I1125 16:04:25.453889 122 solver.cpp:361] [0.0] Train net output #0: loss_bbox = 1.57427 (* 2 = 3.14854 loss)
I1125 16:04:25.453899 122 solver.cpp:361] [0.0] Train net output #1: loss_coverage = 33.2703 (* 1 = 33.2703 loss)
I1125 16:04:25.453910 122 solver.cpp:769] Snapshotting to binary proto file snapshot_iter_63300.caffemodel
I1125 16:04:25.496446 122 sgd_solver.cpp:419] Snapshotting solver state to binary proto file snapshot_iter_63300.solverstate
I1125 16:04:25.533502 122 solver.cpp:501] Iteration 63300, Testing net (#0)
I1125 16:04:33.152792 149 data_reader.cpp:321] Restarting data pre-fetching
I1125 16:04:33.300627 146 data_reader.cpp:321] Restarting data pre-fetching
I1125 16:04:34.276348 122 solver.cpp:588] (0.0) Test net output #0: loss_bbox = 1.08538 (* 2 = 2.17077 loss)
I1125 16:04:34.276383 122 solver.cpp:588] (0.0) Test net output #1: loss_coverage = 26.1989 (* 1 = 26.1989 loss)
I1125 16:04:34.276424 122 solver.cpp:588] (0.0) Test net output #2: mAP = 0
I1125 16:04:34.276433 122 solver.cpp:588] (0.0) Test net output #3: precision = 0
I1125 16:04:34.276438 122 solver.cpp:588] (0.0) Test net output #4: recall = 0
I1125 16:04:34.276461 122 caffe.cpp:265] Solver performance on device 0: 8.144 * 10 = 81.44 img/sec (63300 itr in 7772 sec)
I1125 16:04:34.276476 122 caffe.cpp:269] Optimization Done in 2h 10m 51s
Screenshot of graph: