[TLT3.0][RIVA][Jasper] KeyError manifest.yaml not found

shiva4 · August 24, 2021, 3:58pm

• Hardware (Tesla T4)
• Network Type (speech_to_text)
• TLT Version (tlt_version: 3.0 | docker_tag: v3.0-py3)
• Training spec file(default spec given in training notebook)
• How to reproduce the issue ? (Not able to initiate the finetuning process from pretrained jasper10x5 nemo model)

Morganh · August 24, 2021, 4:17pm

There is no issue when load a tlt model. But when load a nemo file as the pretrained file, there is compatible issue in tlt-pytorch.
For workaround, please untar the .nemo file , then modify /opt/conda/lib/python3.8/site-packages/nemo/core/classes/modelPT.py as below.

@classmethod

   def restore_from( 
        if _EFF_PRESENT_:
             try:
                  #return cls._eff_restore_from(restore_path, override_config_path, map_location, strict)
                  return cls._default_restore_from(restore_path, override_config_path, map_location, strict)

@classmethod

    def  _default_restore_from(
            #model_weights = path.join(tmpdir, _MODEL_WEIGHTS)
            model_weights = "/nemo_untar/model_weights.ckpt"

shiva4 · August 25, 2021, 6:27am

Thanks will try this out!

shiva4 · August 25, 2021, 7:47am

I tried this but untar-ing the .nemo file only gave me model_config.yaml and model_weights.ckpt files and nothing else. So am not sure where to make the above mentioned changes. Please help! Thanks

Morganh · August 25, 2021, 8:01am

Yes, replace with the model_weights.ckpt in the code.

shiva4 · August 25, 2021, 8:15am

I will have to run the tlt docker container to access the modelPT.py file right ? I am not able to find the same on my host machine

Morganh · August 25, 2021, 8:19am

Yes, you can login the tlt docker directly and run training/evaluation/etc.

$ tlt speech_to_text_citrinet run /bin/bash

shiva4 · August 25, 2021, 8:31am

Thanks this worked for me! Is it possible to visualize val_loss, epochs and val_wer metrics on tensorboard ? Currently am able to get tensorboard working with only these 4 metrics: hp_metric, learning_rate, train_loss, training_batch_wer

Morganh · August 25, 2021, 8:35am

The val_loss , epochs and val_wer are shown in the training log.

shiva4 · August 25, 2021, 8:38am

Yes I was able to see them in the logs but just wanted to better visualize them on tensorboard! Anyways do let me know if there is any work around for this! Thanks

shiva4 · August 25, 2021, 12:38pm

Strangely its shutting down the container after epoch 0 printing the log line:
Epoch 0, global step 1637: val_loss reached 7.74545 (best 7.74545), saving model to "/results/jasper/finetune/checkpoints/finetuned-model---val_loss=7.75-epoch=0.ckpt" as top 3 but I wasn’t able to locate the finetuned model in that path

Morganh · August 25, 2021, 12:46pm

The tlt model should be available in your /results/jasper/finetune/ .

shiva4 · August 25, 2021, 12:53pm

Now am able to locate the finetuned model! But the finetuning is not moving past epoch 0 even after setting max_epochs: 50 in finetune.yaml

Morganh · August 25, 2021, 12:56pm

Please resume training from the existing checkpoints. Just run the same command.

shiva4 · August 25, 2021, 6:29pm

I have tried to run multiple times the same cmd but its not going past epoch 0 and the finetuned model is also not getting saved into the checkpoints folder strangely.

Epoch 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 6066/6066 [3:28:17<00:00,  2.06s/it, loss=48.9, v_num=Epoch 0, global step 4548: val_wer reached 0.32157 (best 0.32157), saving model to "/results/jasper/finetune/checkpoints/finetuned-model---val_wer=0.32-epoch=0.ckpt" as top 3
Epoch 0, global step 4548: val_wer reached 0.32157 (best 0.32157), saving model to "/results/jasper/finetune/checkpoints/finetuned-model---val_wer=0.32-epoch=0.ckpt" as top 3
Epoch 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 6066/6066 [3:28:40<00:00,  2.06s/it, loss=48.9, v_num=]2021-08-25 17:09:17,865 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Can you pls explain why could this happen ?

PS: This is happening only with JASPER. I have tried reducing my training and validation workers to 1 each but still it does not go past epoch 0

Morganh · August 26, 2021, 1:20am

Can you share your command and spec file?

shiva4 · August 26, 2021, 4:03am

CMD: tlt speech_to_text finetune -e $SPECS_DIR/speech_to_text/finetune.yaml -g 1 -k $KEY -m $RESULTS_DIR/jasper/train/checkpoints/trained-model.tlt -r $RESULTS_DIR/jasper/finetune finetuning_ds.manifest_filepath=$DATA_DIR/small_train.json validation_ds.manifest_filepath=$DATA_DIR/small_test.json finetuning_ds.num_workers=1 validation_ds.num_workers=1

SPEC FILE :

# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
# TLT spec file for fine-tuning a previously trained ASR model (Jasper or QuartzNet) on the MCV Russian dataset.

exp_manager:
  create_tensorboard_logger: true

trainer:
  max_epochs: 50   # This is low for demo purposes

# Whether or not to change the decoder vocabulary.
# Note that this MUST be set if the labels change, e.g. to a different language's character set
# or if additional punctuation characters are added.
change_vocabulary: false

# Fine-tuning settings: training dataset
finetuning_ds:
  manifest_filepath: ???
  sample_rate: 16000
  labels: [' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
           'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', "'"]
  batch_size: 8
  trim_silence: true
  max_duration: 16.7
  shuffle: true
  is_tarred: false
  tarred_audio_filepaths: null

# Fine-tuning settings: validation dataset
validation_ds:
  manifest_filepath: ???
  sample_rate: 16000
  labels: [' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
           'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', "'"]
  batch_size: 8
  shuffle: false
  max_duration: 16.7

# Fine-tuning settings: optimizer
optim:
  name: novograd
  lr: 0.001

PS: model used for fine tuning - https://api.ngc.nvidia.com/v2/models/nvidia/tlt-riva/speechtotext_english_jasper/versions/trainable_v1.2/files/speechtotext_english_jasper.tlt

Morganh · August 26, 2021, 4:50am

Please add below in your command and retry.
trainer.max_epochs=50

shiva4 · August 26, 2021, 5:16am

I have added trainer.min_epochs=3 trainer.max_epochs=5 to my cmd and tried multiple times but still it shuts down the container immediately after epoch 0 and the finetuned model after first epoch is also not getting saved into checkpoints folder even though there are no errors

Morganh · August 26, 2021, 6:27am

The tlt model is not saving in checkpoints folder. It is saved in parent folder of checkpoints.
Just add trainer.max_epochs=5, it is not needed to set trainer.min_epochs=3 .

Topic		Replies	Views
Nvidia Riva asr finetune KeyError: "filename './manifest.yaml' not found", pretrained model QuartzNet15x5NR-En.nemo Riva riva , tao	1	947	September 29, 2021
ASR Jasper TLT model finetuning stops at epoch 0 Riva riva	8	871	July 4, 2021
Speech_to_text infer: model_weights.ckpt not found Riva	0	671	February 23, 2022
Getting [INFO] tlt.components.docker_handler.docker_handler: Stopping container. Why does this occur and how to fix it? TAO Toolkit	20	1912	August 24, 2021
Error training from scratch with character 'O' in LPRNet TAO Toolkit	14	1008	June 25, 2021
TLT 3 error running detectnet_v2 dataset_convert TAO Toolkit	4	1628	September 27, 2021
Resume yolo_v4 traing - SoftStartCosineAnnealingScheduler does not support a progress value error TAO Toolkit	2	821	September 18, 2021
ImportError: No module named nvml TAO Toolkit cuda	13	1568	October 12, 2021
Train with my own tlt model TAO Toolkit	14	709	December 13, 2021
TLT V2.0 Classification TAO Toolkit	26	2786	August 3, 2021

[TLT3.0][RIVA][Jasper] KeyError manifest.yaml not found

Related topics