Re_identification_net in TAO 5.3.0 checkpoint_interval configuration is not respected ( no checkpoints/ missed checkpoints)

When checkpoint_interval=10 is specified, no checkpoints are saved when training NVIDIA TAO 5.3 ReIdentification model.
When the checkpoint interval is set to 1, training generates checkpoints. However, it’s undesirable to use 1 because it saves too many files.

When checkpoint_interval=5 is specified, only some checkpoints are saved. Following is an example:

In TAO documentation,checkpoint_interval is defined as the interval at which the checkpoints are saved, and no other explanation is provided.

Can you please explain how it is determined which epochs are saved and how we can determine checkpoint_interval to be set in training configuration to achieve predictable checkpoints?

Similar issue with no solution (when checkpoint = 10) : Re_identification_net in TAO 5.3.0 does not generate checkpoints

May I know what is the total epoch you are going to train? Could you please share the training yaml file?

Can you add one line in train.py as below and retry?

checkpoint_callback = ModelCheckpoint(every_n_epochs=checkpoint_interval,
                                      dirpath=results_dir,
                                      save_on_train_epoch_end=True,   #added this line
                                      monitor=None,
                                      save_top_k=-1,
                                      filename='reid_model_{epoch:03d}')

You can run in the docker and modify the code. Above code is in tao_pytorch_backend/nvidia_tao_pytorch/cv/re_identification/scripts/train.py at 9c2d94c0635b1117edfea85a94a6e3d0ead53754 · NVIDIA/tao_pytorch_backend · GitHub.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.