Empty iterations?

I don’t get why the training seem to skip some iterations!

I am running:

!tao model faster_rcnn train -e $SPECS_DIR/default_spec_resnet50-1Class.txt \
                   --gpus 4 \
                    -r /workspace/tao-experiments/faster_rcnn

on the attached specs.
specs.txt (4.1 KB)

Please share the full training log. Thanks.

Log - 29c6fc97df784cffafc588bff43663a2.txt (501.3 KB)
Please find attached. Best

From the log, the dense_class_td_loss is available in each epoch.

Indeed, the information are available by epoch.

ClearML give the plots per iteration (also wall time and time from start).
I thought it might be a ClearML issue, so I compared how TensorBoard visualizes it and I am attaching a comparison.

Is there anything I should change in the visualizer configuration to obtain better logs?

I’ve zoomed in again to investigate the graph, it seems that the skipped iterations are always multiples of 68. First, there is a jump from X to X+ 68, then there are some resonable points plotted, then there is another jump from Y to Y + 2*68 and so on … A bit strange :)

From the log, for example,

1695689795016 b276b83425a3 error INFO: Training loop in progress
1695689795017 b276b83425a3 info Epoch 118/2000
1695689869181 b276b83425a3 info 68/68 [==============================] - 74s 1s/step - loss: 0.5355 - rpn_out_class_loss: 0.0135 - rpn_out_regress_loss: 0.0053 - dense_class_td_loss: 0.1007 - dense_regress_td_loss: 0.0775
1695689873993 b276b83425a3 info Doing validation at epoch 118(1-based index)...
1695689874071 b276b83425a3 error   0%|          | 0/43 [00:00<?, ?it/s]
1695689884309 b276b83425a3 error  60%|██████    | 26/43 [00:10<00:06,  2.55it/s]
1695689890983 b276b83425a3 error 100%|██████████| 43/43 [00:16<00:00,  2.54it/s]
1695689890984 b276b83425a3 info 
Class               AP                  precision           recall              RPN_recall
rumex               0.0294              0.0060              0.3276              0.3966
mAP@0.5 = 0.0294
Validation done!

The training costs 68 steps.
Validation costs 43 steps. Also, there is unexpected “error”.

To narrow down, you can set larger validation_period_during_training: 500 and rerun. And check the graph again.

I have rerun with validation_period_during_training: 500 but I am still getting the same behavior. Attached the output log.
Log - c7d06c66bc4247598f3aead3b4726dce.txt (106.1 KB)

Do you mean the same empty iterations?

Could you run with 1 gpu instead of 4 gpus?

Yes, I meant ‘empty iterations’ or skipped iterations by ‘same behavior’.

I have run the training with 1 GPU:

!tao model faster_rcnn train -e $SPECS_DIR/specs.txt \
                   --gpus 1 \
                    -r /workspace/tao-experiments/faster_rcnn

and I am still getting the same behavior.

Suggest you to run official notebook with public KITTI dataset to check if it can be reproduced.

Hi Morgna,

I’ve run the official notebook and the error is actually reproducible with the notebook and specs as is.

Seems that the behavior is expected since it running evaluation stage during these iterations.

Is there any idea why they would implement this behavior for Fasterrcnn but not for other networks?

During evaluation, there is no update for dense_class_td_loss.
During training, there is update dense_class_td_loss.
So, the iterations you observed should be in evaluation stage.
Other networks also have this function, to run evaluation during training.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.