Model_final checkpoint clarification

Hello,

So far I’ve noticed some odd behaviour with the model_final checkpoint files as they are sometimes written and sometimes not. Are these files only written when training stops by reaching the max iterations? Should it be written if training is manually stopped as well?

Thanks,

Kyle

Hi Kyle,

Thanks for your interest in Clara Tain. You are correct that the model_final checkpoint is written when training successfully completes. If training is interrupted this file is not created, although model_ckpt files may exist from intermediate validation.

You can find more on this here, including the different uses for ckpt and final models:
https://docs.nvidia.com/clara/tlt-mi/nvmidl/mmar.html?highlight=model_final#using-model-ckpt-or-model-final-ckpt

Thanks,
Kris

As a follow up to this- is there a way to save the model checkpoint files every x iterations (in addition to when a new best validation metric is found?). I haven’t been able to find anything like this in documentation.

Thanks!

Kyle

Hi Kyle,

Sorry for the delay on this. For Clara Train v4.0 MMARs (based on MONAI), you can specify the save interval in the CheckpointSaver:

Hope this helps.

-Kris