• Hardware (Tesla T4)
• Network Type (speech_to_text)
• TLT Version (tlt_version: 3.0 | docker_tag: v3.0-py3)
• Training spec file(default spec given in training notebook)
• How to reproduce the issue ? (Not able to initiate the finetuning process from pretrained jasper10x5 nemo model)
There is no issue when load a tlt model. But when load a nemo file as the pretrained file, there is compatible issue in tlt-pytorch.
For workaround, please untar the .nemo file , then modify /opt/conda/lib/python3.8/site-packages/nemo/core/classes/modelPT.py as below.
@classmethod
def restore_from( if _EFF_PRESENT_: try: #return cls._eff_restore_from(restore_path, override_config_path, map_location, strict) return cls._default_restore_from(restore_path, override_config_path, map_location, strict)
@classmethod
def _default_restore_from( #model_weights = path.join(tmpdir, _MODEL_WEIGHTS) model_weights = "/nemo_untar/model_weights.ckpt"
Thanks will try this out!
I tried this but untar-ing the .nemo file only gave me model_config.yaml
and model_weights.ckpt
files and nothing else. So am not sure where to make the above mentioned changes. Please help! Thanks
Yes, replace with the model_weights.ckpt in the code.
I will have to run the tlt docker container to access the modelPT.py
file right ? I am not able to find the same on my host machine
Yes, you can login the tlt docker directly and run training/evaluation/etc.
$ tlt speech_to_text_citrinet run /bin/bash
Thanks this worked for me! Is it possible to visualize val_loss
, epochs
and val_wer
metrics on tensorboard ? Currently am able to get tensorboard working with only these 4 metrics: hp_metric
, learning_rate
, train_loss
, training_batch_wer
The val_loss
, epochs
and val_wer
are shown in the training log.
Yes I was able to see them in the logs but just wanted to better visualize them on tensorboard! Anyways do let me know if there is any work around for this! Thanks
Strangely its shutting down the container after epoch 0
printing the log line:
Epoch 0, global step 1637: val_loss reached 7.74545 (best 7.74545), saving model to "/results/jasper/finetune/checkpoints/finetuned-model---val_loss=7.75-epoch=0.ckpt" as top 3
but I wasn’t able to locate the finetuned model in that path
The tlt model should be available in your /results/jasper/finetune/ .
Now am able to locate the finetuned model! But the finetuning is not moving past epoch 0
even after setting max_epochs: 50
in finetune.yaml
Please resume training from the existing checkpoints. Just run the same command.
I have tried to run multiple times the same cmd but its not going past epoch 0
and the finetuned model is also not getting saved into the checkpoints folder strangely.
Epoch 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 6066/6066 [3:28:17<00:00, 2.06s/it, loss=48.9, v_num=Epoch 0, global step 4548: val_wer reached 0.32157 (best 0.32157), saving model to "/results/jasper/finetune/checkpoints/finetuned-model---val_wer=0.32-epoch=0.ckpt" as top 3
Epoch 0, global step 4548: val_wer reached 0.32157 (best 0.32157), saving model to "/results/jasper/finetune/checkpoints/finetuned-model---val_wer=0.32-epoch=0.ckpt" as top 3
Epoch 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 6066/6066 [3:28:40<00:00, 2.06s/it, loss=48.9, v_num=]2021-08-25 17:09:17,865 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
Can you pls explain why could this happen ?
PS: This is happening only with JASPER. I have tried reducing my training and validation workers to 1 each but still it does not go past epoch 0
Can you share your command and spec file?
CMD: tlt speech_to_text finetune -e $SPECS_DIR/speech_to_text/finetune.yaml -g 1 -k $KEY -m $RESULTS_DIR/jasper/train/checkpoints/trained-model.tlt -r $RESULTS_DIR/jasper/finetune finetuning_ds.manifest_filepath=$DATA_DIR/small_train.json validation_ds.manifest_filepath=$DATA_DIR/small_test.json finetuning_ds.num_workers=1 validation_ds.num_workers=1
SPEC FILE :
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
# TLT spec file for fine-tuning a previously trained ASR model (Jasper or QuartzNet) on the MCV Russian dataset.
exp_manager:
create_tensorboard_logger: true
trainer:
max_epochs: 50 # This is low for demo purposes
# Whether or not to change the decoder vocabulary.
# Note that this MUST be set if the labels change, e.g. to a different language's character set
# or if additional punctuation characters are added.
change_vocabulary: false
# Fine-tuning settings: training dataset
finetuning_ds:
manifest_filepath: ???
sample_rate: 16000
labels: [' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', "'"]
batch_size: 8
trim_silence: true
max_duration: 16.7
shuffle: true
is_tarred: false
tarred_audio_filepaths: null
# Fine-tuning settings: validation dataset
validation_ds:
manifest_filepath: ???
sample_rate: 16000
labels: [' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', "'"]
batch_size: 8
shuffle: false
max_duration: 16.7
# Fine-tuning settings: optimizer
optim:
name: novograd
lr: 0.001
PS: model used for fine tuning - https://api.ngc.nvidia.com/v2/models/nvidia/tlt-riva/speechtotext_english_jasper/versions/trainable_v1.2/files/speechtotext_english_jasper.tlt
Please add below in your command and retry.
trainer.max_epochs=50
I have added trainer.min_epochs=3 trainer.max_epochs=5
to my cmd and tried multiple times but still it shuts down the container immediately after epoch 0
and the finetuned model after first epoch is also not getting saved into checkpoints
folder even though there are no errors
The tlt model is not saving in checkpoints folder. It is saved in parent folder of checkpoints.
Just add trainer.max_epochs=5
, it is not needed to set trainer.min_epochs=3
.