ASR Jasper TLT model finetuning stops at epoch 0

I am finetuning the jasper ASR model with Indian accent data. The training starts but stops immediately after 1st step of epoch 0. There are no other errors in the logs. Please help in debugging the issue so training can be completed.

Hi @shilpa.suresh
Please refer to below link in case it helps:
https://docs.nvidia.com/metropolis/TLT/tlt-user-guide/text/asr/speech_recognition.html#id10

Thanks

Hi @SunilJB ,
I have followed all the guidelines in the link you gave and I am using the finetune.yaml as specified in the documentation ( I have replaced russian letters with english letters and have also set change_vocabulary to false) . The training and validation manifests are also set.
I am running this command:

tlt speech_to_text finetune
-e /specs/finetune.yaml
-g 1
-k $KEY
-m /results/speechtotext_english_jasper_vtrainable_v1.2/speechtotext_english_jasper.tlt
-r /results/model
finetuning_ds.manifest_filepath=/data/train_total.json
validation_ds.manifest_filepath=/data/dev/dev_com.json
trainer.max_epochs=3
finetuning_ds.num_workers=20
validation_ds.num_workers=20
trainer.gpus=1

There are no errors logged. The training stops at the 0th step showing:

Epoch 0, global step 0: val_loss reached 34.24907 (best 34.24907), saving model to “/results/model/checkpoints/finetuned-model—val_loss=34.25-epoch=0.ckpt” as top 3

Please help us figuring out the mistake. Also I have a small question. Is the model (.tlt) file enough for finetuning? Or do you need both model (.tlt) and checkpoint (.ckpt) file for the finetuning process?

@SunilJB +1 on this issue

Hi @shilpa.suresh @shiva4
We are looking into it, will keep you posted in case of any updates.

Thanks

1 Like

Hi @shilpa.suresh

The comment
"Epoch 0, global step 0: val_loss reached 34.24907 (best 34.24907), saving model to “/results/model/checkpoints/finetuned-model—val_loss=34.25-epoch=0.ckpt” as top 3
"
doesn’t mean that the training ended - it is just saving one of the intermediate checkpoints that are saved during training.

After training is done, the script will output something like:
“Trained model saved to ‘xxx/checkpoints/trained_model.tlt’”

Honestly, this is looks weird, looked at the attached screenshot and I see no error logs, just docker stopping the container.

Can you please attach the full log? Thanks

Hi,
Please find the logs attached.

Thanks
cmd-args.log (498 Bytes)
lightning_logs.txt (41.1 KB)
nemo_error_log.txt (4.4 KB)
nemo_log_globalrank-0_localrank-0.txt (9.3 KB)

Adding
@erlee
@shrinidhi10

Hi @shilpa.suresh ,

If you run the training and it completes. Then when you run it the second time with the same -r(results_dir) arugment, then it will try to resume and see you already finish 3 epochs. Then it will stop training. So this is expected if you already have the results directory. If you do want to run it again, try to remove the previous result directory before running TLT training.