When checkpoint_interval=10 is specified, no checkpoints are saved when training NVIDIA TAO 5.3 ReIdentification model.
When the checkpoint interval is set to 1, training generates checkpoints. However, it’s undesirable to use 1 because it saves too many files.
When checkpoint_interval=5 is specified, only some checkpoints are saved. Following is an example:
In TAO documentation,checkpoint_interval is defined as the interval at which the checkpoints are saved, and no other explanation is provided.
Can you please explain how it is determined which epochs are saved and how we can determine checkpoint_interval to be set in training configuration to achieve predictable checkpoints?
There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks