[TLT3.0][RIVA][Jasper] KeyError manifest.yaml not found

@SunilJB @Morganh

• Hardware (Tesla T4)
• Network Type (speech_to_text)
• TLT Version (tlt_version: 3.0 | docker_tag: v3.0-py3)
• Training spec file(default spec given in training notebook)
• How to reproduce the issue ? (Not able to initiate the finetuning process from pretrained jasper10x5 nemo model)

There is no issue when load a tlt model. But when load a nemo file as the pretrained file, there is compatible issue in tlt-pytorch.
For workaround, please untar the .nemo file , then modify /opt/conda/lib/python3.8/site-packages/nemo/core/classes/modelPT.py as below.

@classmethod

   def restore_from( 
        if _EFF_PRESENT_:
             try:
                  #return cls._eff_restore_from(restore_path, override_config_path, map_location, strict)
                  return cls._default_restore_from(restore_path, override_config_path, map_location, strict)

@classmethod

    def  _default_restore_from(
            #model_weights = path.join(tmpdir, _MODEL_WEIGHTS)
            model_weights = "/nemo_untar/model_weights.ckpt"
1 Like

Thanks will try this out!

I tried this but untar-ing the .nemo file only gave me model_config.yaml and model_weights.ckpt files and nothing else. So am not sure where to make the above mentioned changes. Please help! Thanks

Yes, replace with the model_weights.ckpt in the code.

I will have to run the tlt docker container to access the modelPT.py file right ? I am not able to find the same on my host machine

Yes, you can login the tlt docker directly and run training/evaluation/etc.

$ tlt speech_to_text_citrinet run /bin/bash

1 Like

Thanks this worked for me! Is it possible to visualize val_loss, epochs and val_wer metrics on tensorboard ? Currently am able to get tensorboard working with only these 4 metrics: hp_metric, learning_rate, train_loss, training_batch_wer

The val_loss , epochs and val_wer are shown in the training log.

1 Like

Yes I was able to see them in the logs but just wanted to better visualize them on tensorboard! Anyways do let me know if there is any work around for this! Thanks

Strangely its shutting down the container after epoch 0 printing the log line:
Epoch 0, global step 1637: val_loss reached 7.74545 (best 7.74545), saving model to "/results/jasper/finetune/checkpoints/finetuned-model---val_loss=7.75-epoch=0.ckpt" as top 3 but I wasn’t able to locate the finetuned model in that path

The tlt model should be available in your /results/jasper/finetune/ .

Now am able to locate the finetuned model! But the finetuning is not moving past epoch 0 even after setting max_epochs: 50 in finetune.yaml

Please resume training from the existing checkpoints. Just run the same command.

1 Like

I have tried to run multiple times the same cmd but its not going past epoch 0 and the finetuned model is also not getting saved into the checkpoints folder strangely.

Epoch 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 6066/6066 [3:28:17<00:00,  2.06s/it, loss=48.9, v_num=Epoch 0, global step 4548: val_wer reached 0.32157 (best 0.32157), saving model to "/results/jasper/finetune/checkpoints/finetuned-model---val_wer=0.32-epoch=0.ckpt" as top 3
Epoch 0, global step 4548: val_wer reached 0.32157 (best 0.32157), saving model to "/results/jasper/finetune/checkpoints/finetuned-model---val_wer=0.32-epoch=0.ckpt" as top 3
Epoch 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 6066/6066 [3:28:40<00:00,  2.06s/it, loss=48.9, v_num=]2021-08-25 17:09:17,865 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Can you pls explain why could this happen ?

PS: This is happening only with JASPER. I have tried reducing my training and validation workers to 1 each but still it does not go past epoch 0

Can you share your command and spec file?

CMD: tlt speech_to_text finetune -e $SPECS_DIR/speech_to_text/finetune.yaml -g 1 -k $KEY -m $RESULTS_DIR/jasper/train/checkpoints/trained-model.tlt -r $RESULTS_DIR/jasper/finetune finetuning_ds.manifest_filepath=$DATA_DIR/small_train.json validation_ds.manifest_filepath=$DATA_DIR/small_test.json finetuning_ds.num_workers=1 validation_ds.num_workers=1

SPEC FILE :

# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
# TLT spec file for fine-tuning a previously trained ASR model (Jasper or QuartzNet) on the MCV Russian dataset.

exp_manager:
  create_tensorboard_logger: true

trainer:
  max_epochs: 50   # This is low for demo purposes

# Whether or not to change the decoder vocabulary.
# Note that this MUST be set if the labels change, e.g. to a different language's character set
# or if additional punctuation characters are added.
change_vocabulary: false

# Fine-tuning settings: training dataset
finetuning_ds:
  manifest_filepath: ???
  sample_rate: 16000
  labels: [' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
           'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', "'"]
  batch_size: 8
  trim_silence: true
  max_duration: 16.7
  shuffle: true
  is_tarred: false
  tarred_audio_filepaths: null

# Fine-tuning settings: validation dataset
validation_ds:
  manifest_filepath: ???
  sample_rate: 16000
  labels: [' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
           'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', "'"]
  batch_size: 8
  shuffle: false
  max_duration: 16.7

# Fine-tuning settings: optimizer
optim:
  name: novograd
  lr: 0.001

PS: model used for fine tuning - https://api.ngc.nvidia.com/v2/models/nvidia/tlt-riva/speechtotext_english_jasper/versions/trainable_v1.2/files/speechtotext_english_jasper.tlt

Please add below in your command and retry.
trainer.max_epochs=50

I have added trainer.min_epochs=3 trainer.max_epochs=5 to my cmd and tried multiple times but still it shuts down the container immediately after epoch 0 and the finetuned model after first epoch is also not getting saved into checkpoints folder even though there are no errors

The tlt model is not saving in checkpoints folder. It is saved in parent folder of checkpoints.
Just add trainer.max_epochs=5, it is not needed to set trainer.min_epochs=3 .