Empty iterations?

nasserha · September 25, 2023, 9:59pm

I don’t get why the training seem to skip some iterations!

I am running:

!tao model faster_rcnn train -e $SPECS_DIR/default_spec_resnet50-1Class.txt \
                   --gpus 4 \
                    -r /workspace/tao-experiments/faster_rcnn

on the attached specs.
specs.txt (4.1 KB)

Morganh · September 26, 2023, 4:57am

Please share the full training log. Thanks.

nasserha · September 26, 2023, 6:19am

Log - 29c6fc97df784cffafc588bff43663a2.txt (501.3 KB)
Please find attached. Best

Morganh · September 26, 2023, 6:37am

From the log, the dense_class_td_loss is available in each epoch.

nasserha · September 26, 2023, 5:55pm

Indeed, the information are available by epoch.

ClearML give the plots per iteration (also wall time and time from start).
I thought it might be a ClearML issue, so I compared how TensorBoard visualizes it and I am attaching a comparison.

Is there anything I should change in the visualizer configuration to obtain better logs?

nasserha · September 26, 2023, 9:40pm

I’ve zoomed in again to investigate the graph, it seems that the skipped iterations are always multiples of 68. First, there is a jump from X to X+ 68, then there are some resonable points plotted, then there is another jump from Y to Y + 2*68 and so on … A bit strange :)

Morganh · September 27, 2023, 1:56am

From the log, for example,

1695689795016 b276b83425a3 error INFO: Training loop in progress
1695689795017 b276b83425a3 info Epoch 118/2000
1695689869181 b276b83425a3 info 68/68 [==============================] - 74s 1s/step - loss: 0.5355 - rpn_out_class_loss: 0.0135 - rpn_out_regress_loss: 0.0053 - dense_class_td_loss: 0.1007 - dense_regress_td_loss: 0.0775
1695689873993 b276b83425a3 info Doing validation at epoch 118(1-based index)...
1695689874071 b276b83425a3 error   0%|          | 0/43 [00:00<?, ?it/s]
1695689884309 b276b83425a3 error  60%|██████    | 26/43 [00:10<00:06,  2.55it/s]
1695689890983 b276b83425a3 error 100%|██████████| 43/43 [00:16<00:00,  2.54it/s]
1695689890984 b276b83425a3 info 
==========================================================================================
Class               AP                  precision           recall              RPN_recall
------------------------------------------------------------------------------------------
rumex               0.0294              0.0060              0.3276              0.3966
------------------------------------------------------------------------------------------
mAP@0.5 = 0.0294
Validation done!

The training costs 68 steps.
Validation costs 43 steps. Also, there is unexpected “error”.

To narrow down, you can set larger validation_period_during_training: 500 and rerun. And check the graph again.

nasserha · September 27, 2023, 3:01pm

I have rerun with validation_period_during_training: 500 but I am still getting the same behavior. Attached the output log.
Log - c7d06c66bc4247598f3aead3b4726dce.txt (106.1 KB)

Morganh · September 27, 2023, 3:30pm

Do you mean the same empty iterations?

Could you run with 1 gpu instead of 4 gpus?

nasserha · September 27, 2023, 9:23pm

Yes, I meant ‘empty iterations’ or skipped iterations by ‘same behavior’.

I have run the training with 1 GPU:

!tao model faster_rcnn train -e $SPECS_DIR/specs.txt \
                   --gpus 1 \
                    -r /workspace/tao-experiments/faster_rcnn

and I am still getting the same behavior.

Morganh · September 29, 2023, 4:02pm

Suggest you to run official notebook with public KITTI dataset to check if it can be reproduced.

nasserha · September 29, 2023, 10:08pm

Hi Morgna,

I’ve run the official notebook and the error is actually reproducible with the notebook and specs as is.

Morganh · September 30, 2023, 3:35am

Seems that the behavior is expected since it running evaluation stage during these iterations.

nasserha · October 4, 2023, 3:46pm

Is there any idea why they would implement this behavior for Fasterrcnn but not for other networks?

Morganh · October 5, 2023, 3:47pm

During evaluation, there is no update for dense_class_td_loss.
During training, there is update dense_class_td_loss.
So, the iterations you observed should be in evaluation stage.
Other networks also have this function, to run evaluation during training.

system · October 19, 2023, 3:48pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Faster R CNN Training stops after 1 Epoch TAO Toolkit	7	748	October 12, 2021
ValueError: Number of logging points 50 must be <= than the number of steps per epoch 16 TAO Toolkit	2	527	August 29, 2023
Epochs + Training loss TensorBoard Technical Support (PhysicsNeMo Only)	0	1076	April 3, 2024
TAO Toolkit Train of an EfficientDet-D0 is stuck! TAO Toolkit	21	1158	August 2, 2022
Very slow initialization of training and first epoch TAO Toolkit	11	3252	September 30, 2021
Tensorboard and ability to get per image loss TAO Toolkit	6	521	October 12, 2021
The training process of Tao-Toolkit-API unet is always in Inf status TAO Toolkit api , tao	61	2754	June 12, 2023
Loss, acc, val_acc get stablized soon in both train and re-train TAO Toolkit	6	492	July 3, 2023
Tao classification tf1 could not load optimizer with init_epoch flag TAO Toolkit	6	118	September 23, 2024
Faster rcnn training aborts after few epochs Frameworks (archived)	0	467	September 28, 2022

Empty iterations?

Related topics