Tao speech_to_text evaluate+infer show very weak results

I thought, the quartznet which is loaded at the very beginning of the notebook IS already a pre-trained model, than an additional training is performed by the an4 train set
(tao speech_to_text train …/an4_converted/train_manifest.json)
and after that a finetuning can be done with the same an4 training set (or with a different one).
Was I wrong here?

when using your link in
‘.nemo model here.’
I end up here:

and clicking on ‘ASR with NeMo’ I get
‘Hmm. We’re having trouble finding that site.’
Has the model been moved to somewhere else?
Could you recomment which pre-trained model is best to use?

This one, fetched via:
ngc registry model download-version “nvidia/tao/speechtotext_en_us_quartznet:deployable_v1.2”

does not seem to be the right one, since when I use it with -m option in the finetue step I get following errors:

[NeMo I 2022-01-31 15:02:34 features:252] PADDING: 16
[NeMo I 2022-01-31 15:02:34 features:269] STFT using torch
Error executing job with overrides: [‘exp_manager.explicit_log_dir=/results/quartznet/finetuneQNus’, ‘trainer.gpus=1’, ‘restore_from=/results/PreTrainedModels/speechtotext_en_us_quartznet_vdeployable_v1.2/quartznet_asr_set_1pt2.riva’, ‘encryption_key=tlt_encode’, ‘finetuning_ds.manifest_filepath=/data/an4_converted/train_manifest.json’, ‘validation_ds.manifest_filepath=/data/an4_converted/test_manifest.json’, ‘trainer.max_epochs=51’, ‘finetuning_ds.num_workers=20’, ‘validation_ds.num_workers=20’, ‘trainer.gpus=1’]
Traceback (most recent call last):
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/tlt_utils/connectors/save_restore_connector.py”, line 77, in restore_from
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/tlt_utils/cookbooks/nemo_cookbook.py”, line 396, in restore_from
TypeError: Archive doesn’t have the required runtime, format, version or object class type

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 211, in run_and_report
return func()
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 368, in
lambda: hydra.run(
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py”, line 110, in run
_ = ret.return_value
File “/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py”, line 233, in return_value
raise self._return_value
File “/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py”, line 160, in run_job
ret.return_value = task_function(task_cfg)
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/asr/speech_to_text/scripts/finetune.py”, line 120, in main
File “/opt/conda/lib/python3.8/site-packages/nemo/core/classes/modelPT.py”, line 270, in restore_from
instance = cls._save_restore_connector.restore_from(
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/tlt_utils/connectors/save_restore_connector.py”, line 87, in restore_from
File “/opt/conda/lib/python3.8/site-packages/nemo/core/connectors/save_restore_connector.py”, line 140, in restore_from
self._load_state_dict_from_disk(model_weights, map_location=map_location), strict=strict
File “/opt/conda/lib/python3.8/site-packages/nemo/core/connectors/save_restore_connector.py”, line 390, in _load_state_dict_from_disk
return torch.load(model_weights, map_location=map_location)
File “/opt/conda/lib/python3.8/site-packages/torch/serialization.py”, line 594, in load
with _open_file_like(f, ‘rb’) as opened_file:
File “/opt/conda/lib/python3.8/site-packages/torch/serialization.py”, line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File “/opt/conda/lib/python3.8/site-packages/torch/serialization.py”, line 211, in init
super(open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpyx9ge45

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/asr/speech_to_text/scripts/finetune.py”, line 149, in
File “/opt/conda/lib/python3.8/site-packages/nemo/core/config/hydra_runner.py”, line 101, in wrapper
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 367, in _run_hydra
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 251, in run_and_report
assert mdl is not None
2022-01-31 16:02:38,213 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I could make it run by first using the pre-trained model in a training step instead immediatly in a finetune step (as done above, giving the errors above). So I trained 400 max_epochs on an4_train, but evaluation again showed poor results:
{‘test_loss’: 42.914344787597656, ‘test_wer’: 0.8331177234649658}
and also finetuning did not improve.

so speechtotext_en_us_quartznet_vdeployable_v1.2/quartznet_asr_set_1pt2.riva does not seem be the right ‘pre-trained’ modell for obtaining better results.
Which could you recommend?
(or could you tell which models and training data were used in Speech Recognition — TAO Toolkit 3.21.11 documentation to achieve the reported, good inference results:
[NeMo I 2021-01-21 00:22:00 infer:67] The prediction results:
[NeMo I 2021-01-21 00:22:00 infer:69] File: /data/an4/wav/an4test_clstk/fcaw/an406-fcaw-b.wav
[NeMo I 2021-01-21 00:22:00 infer:70] Predicted transcript: rubout g m e f three nine
[NeMo I 2021-01-21 00:22:00 infer:69] File: /data/an4/wav/an4test_clstk/fcaw/an407-fcaw-b.wav
[NeMo I 2021-01-21 00:22:00 infer:70] Predicted transcript: erase c q q f seven

(what I also don’t understand how these goof inference results fit with the reported weak WER:
{‘test_loss’: tensor(68.1998, device=‘cuda:0’), ‘test_wer’: tensor(0.9987, device=‘cuda:0’)} ???

many thanks!

Sorry for late reply. Could you please use STT En Citrinet 512 | NVIDIA NGC ?

Thanks for coming back and for the link. I tried stt_en_citrinet_512.nemo an it seems to work much better, loss and WER on the ad4 test set are much lower!
{‘test_loss’: 1.211700439453125, ‘test_wer’: 0.051746442914009094}
But this just for the as-is, original, pre-trained model. When trying any training or finetuning step to adapt the model a bit more to the
an4 training set (in most of the training epochs the log says: " val_loss was not in top 3") the WER drops down again to
{‘test_loss’: 45.33321762084961, ‘test_wer’: 0.8331177234649658}
i.e. the training with an4 does harm more than it helps. Is there s.th. wrong with it? Have you seen that also? Or do you have any other idea?
many thanks!

Could you mkdir a new result folder and double check? Thanks.

??what exactly do you mean?
I made all experiments based on stt_en_citrinet_512.nemo in fresh result directory. These were the 3 eval runs:

UseModel=“citrinet512/TrainedWithAn4train/checkpoints/trained-model.tlt” ### after 100 Trainings epochs
OutD=“evaluateCN512” ## {‘test_loss’: 101.45941162109375, ‘test_wer’: 0.8576973080635071}
UseModel=“citrinet512/TrainedWithAn4train/trained-model_epoch_69.tlt” ### after 69 Trainings epochs (loss had smallest value at e69)
OutD=“evaluateCN512_Ep69” ## {‘test_loss’: 45.33321762084961, ‘test_wer’: 0.8331177234649658}
UseModel=“PreTrainedModels/stt_en_citrinet_512.nemo” ### without any training on the original model - this was the best!!!
OutD=“evaluateCN512_Org” ## {‘test_loss’: 1.211700439453125, ‘test_wer’: 0.051746442914009094}

!tao speech_to_text evaluate
-e $SPECS_DIR/speech_to_text/evaluate.yaml
-g 1
-k $KEY
-m $RESULTS_DIR/$UseModel

OK, thanks for the confirmation.

Please run below to train an4 dataset with the .nemo file as the pretrained model.
$ tao speech_to_text finetune xxx -m pretrained_model

running finetune with the original Pretrained model did *NOT work, i.e.
the following


!tao speech_to_text finetune
-e $SPECS_DIR/speech_to_text/finetune.yaml
-g 1
-k $KEY
-m $RESULTS_DIR/$PreTrainedModel

showed an error: TypeError: change_vocabulary() got an unexpected keyword argument ‘new_vocabulary’

But when using this
instead (which was the outcome of the ‘training’ step before) the finetuning runs, producing

when doing evaluation using this model

!tao speech_to_text evaluate
-e $SPECS_DIR/speech_to_text/evaluate.yaml
-g 1
-k $KEY
-m $RESULTS_DIR/$UseModel

shows a WER
{‘test_loss’: 829.0150756835938, ‘test_wer’: 0.7878395915031433}
which is ome poiunts higher but still worse - I’m currently running more
fine tuning epochs, however seeing always these Epoch xxxx, global step 1859: val_loss was not in top 3
I fear there will be not much improvement…

Yes, more finetuning epochs did not help, already the second one has smallest loss (or do I see this wrong?):
├── finetuned-model–val_loss=50.38-epoch=2.ckpt
├── finetuned-model–val_loss=51.63-epoch=5.ckpt
├── finetuned-model–val_loss=51.64-epoch=6.ckpt
├── finetuned-model–val_loss=859.96-epoch=99-last.ckpt
└── finetuned-model.tlt

Should I change any of the (default) training/finetuning parameters?

Oupps, wanted to run evaluate with this newly finetuned model but an error occurs (also when rerunning the ok models from yesterday):
Docker instantiation failed with error: 500 Server Error: Internal Server Error (“OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown”)

??? tried jupyter restart etc. didn’ help
Any proposal what to do?

How about running
$ nvidia-container-cli --load-kmods info

Found some hints in the web: After some
sudo apt purge nvidia* libnvidia*
sudo apt install nvidia-driver-470 nvidia-container-toolkit

  • reboot (!seems important) it works again:

nvidia-container-cli --load-kmods info
NVRM version: 470.103.01
CUDA version: 11.4
Device Index: 0
Device Minor: 0
Model: Tesla V100-PCIE-32GB
Brand: Tesla
GPU UUID: GPU-9aba4cc1-7f31-6888-6c3e-7198186f0848
Bus Location: 00000000:03:00.0
Architecture: 7.0

running eval on this newly finetuned (after 100 epochs) model, i.e.

!tao speech_to_text evaluate
-e $SPECS_DIR/speech_to_text/evaluate.yaml
-g 1
-k $KEY
-m $RESULTS_DIR/$UseModel

{‘test_loss’: 859.9560546875, ‘test_wer’: 0.8137128353118896}
i.e. a higher than before (after 56 epochs) which was:
{‘test_loss’: 829.0150756835938, ‘test_wer’: 0.7878395915031433}

Should I change any of the (default) training/finetuning parameters? (which/how)?
Or what else could be the reason for deterioration after finetuning compared to original pre-trained model ??

Could you share your finetune.yaml?

Please use below finetune.yaml and pretrained model (Speech to Text English QuartzNet | NVIDIA NGC)

# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
# TLT spec file for fine-tuning a previously trained ASR model (Jasper or QuartzNet) on the MCV Russian dataset.
  max_epochs: 3   # This is low for demo purposes
tlt_checkpoint_interval: 1
# Whether or not to change the decoder vocabulary.
# Note that this MUST be set if the labels change, e.g. to a different language's character set
# or if additional punctuation characters are added.
change_vocabulary: false
#change_vocabulary: true
# Fine-tuning settings: training dataset
  manifest_filepath: ???
  sample_rate: 16000
  labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
           "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
  batch_size: 32
  trim_silence: true
  max_duration: 16.7
  shuffle: true
  is_tarred: false
  tarred_audio_filepaths: null
# Fine-tuning settings: validation dataset
  manifest_filepath: ???
  sample_rate: 16000
  labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
           "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
  batch_size: 32
  shuffle: false
# Fine-tuning settings: optimizer
  name: novograd
  lr: 0.001

I run evaluation before finetuning.

# speech_to_text evaluate -e specs/speech_to_text/evaluate.yaml -k tlt_encode -m speechtotext_english_quartznet.tlt -r evalution_result_quartznet_tlt test_ds.manifest_filepath=data/an4_converted/test_manifest.json

{'test_loss': 1.9822663068771362, 'test_wer': 0.08408796787261963}

Run finetuning for 100 epochs.

# speech_to_text finetune -e specs/speech_to_text/finetune.yaml -k tlt_encode -m speechtotext_english_quartznet.tlt -r result_finetune_again_2 finetuning_ds.manifest_filepath=data/an4_converted/train_manifest.json validation_ds.manifest_filepath=data/an4_converted/test_manifest.json trainer.max_epochs=100

Get below result.

# speech_to_text evaluate -e specs/speech_to_text/evaluate.yaml -k tlt_encode -m result_finetune_again_2/checkpoints/finetuned-model.tlt -r evalution_result_quartznet_tlt_finetune test_ds.manifest_filepath=data/an4_converted/test_manifest.json

{'test_loss': 1.7699419260025024, 'test_wer': 0.05304010212421417}

That means finetuning takes effect and gets better result.

Dear Morgan, Many thanks !!
I tried to reproduce your figures. However, down loading the model via
ngc registry model download-version “nvidia/tao/speechtotext_english_quartznet:deployable_v1.2”
seem to give only a riva file
…/ speechtotext_english_quartznet_vdeployable_v1.2/quartznet_asr_set_1pt2.riva
but riva file cannot be used for finetuning (gives TypeError: Archive doesn’t have the required runtime, format, version or object class type)

Could you please tell which quartznet model file exacty you used/downloaded ?

many thanks

Please download the trainable tlt file.

wget https://api.ngc.nvidia.com/v2/models/nvidia/tao/speechtotext_english_quartznet/versions/trainable_v1.2/files/speechtotext_english_quartznet.tlt

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.