Specify more frequent checkpoints

Is there a setting to cause more frequent checkpoints (than one per epoch)? I have several million examples in my manifest and would like to be able to preserve training and resume before end-of-epoch.

Which network did you run?

The pre-trained model (-m) used was the speechtotext_english_quartznet.tlt

I’m not sure if that is what you mean by ‘network’.

Thanks.So you are running tlt speech_to_text .

Ah, yes. tlt speech_to_text finetune

Evaluation on ckpt file is not supported yet.
Internal team will check it.

On the the tlt speech_to_text finetune (task and subtask), I’d like to know if more frequent checkpoints can be specified (than once per epoch)?

Do you mean you want to get more checkpoints files?

Yes, is there a setting for intra-epoch checkpoints? For example, setting something like ckpt_interval_minutes so that checkpoints could be written every N minutes.

From looking at the pytorch lightning documentation there is an option to specify checkpoints every n steps.
And, looking at the logging about the pre-trained model’s Experiment configuration: there is a section where this perhaps can be modified via the spec yaml, for example:

exp_manager:
  checkpoint_callback_params:
    save_top_k: 5
    every_n_train_steps: 7200 

I’ll test once I have some free GPUs.

Currently, it is not supported yet.

I gave it a try anyhow this is the error:

Error merging 'exp_finetune.yaml' with schema                                 
Key 'every_n_train_steps' not in 'CallbackParams'                             
        full_key: exp_manager.checkpoint_callback_params.every_n_train_steps  
        reference_type=Optional[CallbackParams]                               
        object_type=CallbackParams

Next TLT release will support saving tlt model every N training steps.

1 Like