I don’t get why the training seem to skip some iterations!
I am running:
!tao model faster_rcnn train -e $SPECS_DIR/default_spec_resnet50-1Class.txt \
--gpus 4 \
-r /workspace/tao-experiments/faster_rcnn
on the attached specs.
specs.txt (4.1 KB)
Morganh
September 26, 2023, 4:57am
3
Please share the full training log. Thanks.
Morganh
September 26, 2023, 6:37am
5
From the log, the dense_class_td_loss
is available in each epoch.
Indeed, the information are available by epoch.
ClearML give the plots per iteration (also wall time and time from start).
I thought it might be a ClearML issue, so I compared how TensorBoard visualizes it and I am attaching a comparison.
Is there anything I should change in the visualizer configuration to obtain better logs?
I’ve zoomed in again to investigate the graph, it seems that the skipped iterations are always multiples of 68. First, there is a jump from X to X+ 68, then there are some resonable points plotted, then there is another jump from Y to Y + 2*68 and so on … A bit strange :)
Morganh
September 27, 2023, 1:56am
8
From the log, for example,
1695689795016 b276b83425a3 error INFO: Training loop in progress
1695689795017 b276b83425a3 info Epoch 118/2000
1695689869181 b276b83425a3 info 68/68 [==============================] - 74s 1s/step - loss: 0.5355 - rpn_out_class_loss: 0.0135 - rpn_out_regress_loss: 0.0053 - dense_class_td_loss: 0.1007 - dense_regress_td_loss: 0.0775
1695689873993 b276b83425a3 info Doing validation at epoch 118(1-based index)...
1695689874071 b276b83425a3 error 0%| | 0/43 [00:00<?, ?it/s]
1695689884309 b276b83425a3 error 60%|██████ | 26/43 [00:10<00:06, 2.55it/s]
1695689890983 b276b83425a3 error 100%|██████████| 43/43 [00:16<00:00, 2.54it/s]
1695689890984 b276b83425a3 info
==========================================================================================
Class AP precision recall RPN_recall
------------------------------------------------------------------------------------------
rumex 0.0294 0.0060 0.3276 0.3966
------------------------------------------------------------------------------------------
mAP@0.5 = 0.0294
Validation done!
The training costs 68 steps.
Validation costs 43 steps. Also, there is unexpected “error”.
To narrow down, you can set larger validation_period_during_training: 500
and rerun. And check the graph again.
I have rerun with validation_period_during_training: 500
but I am still getting the same behavior. Attached the output log.
Log - c7d06c66bc4247598f3aead3b4726dce.txt (106.1 KB)
Morganh
September 27, 2023, 3:30pm
10
Do you mean the same empty iterations?
Could you run with 1 gpu instead of 4 gpus?
Yes, I meant ‘empty iterations’ or skipped iterations by ‘same behavior’.
I have run the training with 1 GPU:
!tao model faster_rcnn train -e $SPECS_DIR/specs.txt \
--gpus 1 \
-r /workspace/tao-experiments/faster_rcnn
and I am still getting the same behavior.
Morganh
September 29, 2023, 4:02pm
12
Suggest you to run official notebook with public KITTI dataset to check if it can be reproduced.
Hi Morgna,
I’ve run the official notebook and the error is actually reproducible with the notebook and specs as is.
Morganh
September 30, 2023, 3:35am
14
Seems that the behavior is expected since it running evaluation stage during these iterations.
Is there any idea why they would implement this behavior for Fasterrcnn but not for other networks?
During evaluation, there is no update for dense_class_td_loss
.
During training, there is update dense_class_td_loss
.
So, the iterations you observed should be in evaluation stage.
Other networks also have this function, to run evaluation during training.
1 Like
system
Closed
October 19, 2023, 3:48pm
17
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.