Tao speech_to_text evaluate+infer show very weak results

Following the description in
https://docs.nvidia.com/tao/tao-toolkit/text/asr/speech_recognition.html
(resp. “nvidia/tao/speechtotext_notebook:v1.3”)
I trained QuartzNet 400 epochs on an4 trainset.

Evaluation shows
{‘test_loss’: 58.62380599975586, ‘test_wer’: 0.8576973080635071}
a WER of 85% is very weak, I suppose!

Also when running inference on the test set, the results also don’t look great, e.g.:
[NeMo I 2022-01-19 16:03:52 infer:70] The prediction results:
[NeMo I 2022-01-19 16:03:52 infer:72] File: /data/an4_converted/wavs/an406-fcaw-b.wav
[NeMo I 2022-01-19 16:03:52 infer:73] Predicted transcript: rubout sey nine
[NeMo I 2022-01-19 16:03:52 infer:72] File: /data/an4_converted/wavs/an407-fcaw-b.wav
[NeMo I 2022-01-19 16:03:52 infer:73] Predicted transcript: erase o tt ie
[NeMo I 2022-01-19 16:03:52 infer:72] File: /data/an4_converted/wavs/an408-fcaw-b.wav
[NeMo I 2022-01-19 16:03:52 infer:73] Predicted transcript: o t t fe thre

at least compared to the recognition results you give in
https://docs.nvidia.com/tao/tao-toolkit/text/asr/speech_recognition.html

(I also tried fine-tuning on train-set - no improvement.

What might I have done wrong (I didn’t change much things besides
trainer.max_epochs=400)? Did’t you use other training data as well?
Other parameter values (trainer.max_epochs…) that the ones given in the doc?

or did you use some kind of N-grams (or other language models) for improving the recognized character sequences (is that described somewhere)?

Is it possible also to output more detailed results (per utterance, Insertions,Deletions,Substitutions) when doing !tao speech_to_text evaluate … ??

Thanks for any hint!


having run now several trainings/fine tunings and testet following 3 on the 130 utterances of an4_converted/test_manifest.json showed a max. Wordaccuracy of only 18.2 %:

grep “Percent Word Accuracy” InferResults/*.summary | sed ‘s|^|##> |’
##> InferResults/an4_test.res.summary:Percent Word Accuracy = 13.2%
##> InferResults/an4_test_FT20220118.res.summary:Percent Word Accuracy = 18.2%
##> InferResults/an4_test_X.res.summary:Percent Word Accuracy = 14.2%

not surprising, because letters are recognized as words here - how can you improve that? Is there a language model for the letter sequences?

what do I do wrong?
any hint for that?

Sorry for late reply. May I know if you are running with the official released jupyter notebook?

https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_quick_start_guide.html#use-the-examples

No prob - thanks for coming back.
I followed this: Speech to Text Notebook | NVIDIA NGC
which is the link “Speech to Text” in
TAO Toolkit Quick Start Guide — TAO Toolkit 3.21.11 documentation
i.e. exactly what you mentioned.
(btw it’s rather confusing, the many different places, like https://catalog.ngc… - NVIDIA TAO Documentation … - it’s hard to figure out what’s the relevant one – sometimes I felt lost in space …)

however I used
ngc registry resource download-version “nvidia/tao/speechtotext_notebook:v1.3”
since Error: ‘nvidia/tao/speechtotext_notebook:v1.0’ could not be found.

As described, I trained and testet with the an4 data. Had to adapt of course some things like paths or “trainer.max_epochs=400”
I only tried QuartzNet, not Jasper.

During training I got some warnings, though think they are not crucial (I append some at the end) since modells are created and written and infer was possible also, however recognition is poor.
kind regards,
Andi

2022-01-24 11:19:38,677 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/akiessling/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
[NeMo W 2022-01-24 10:19:42 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/torchaudio-0.7.0a0+42d447d-py3.8-linux-x86_64.egg/torchaudio/backend/utils.py:53: UserWarning: “sox” backend is being deprecated. The default backend will be changed to “sox_io” backend in 0.8.0 and “sox” backend will be removed in 0.9.0. Please migrate to “sox_io” backend. Please refer to [Announcement] Improving I/O for correct and consistent experience · Issue #903 · pytorch/audio · GitHub for the detail.
warnings.warn(

[NeMo W 2022-01-24 10:19:42 experimental:27] Module <class ‘nemo.collections.asr.data.audio_to_text_dali._AudioTextDALIDataset’> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-01-24 10:19:45 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/torchaudio-0.7.0a0+42d447d-py3.8-linux-x86_64.egg/torchaudio/backend/utils.py:53: UserWarning: “sox” backend is being deprecated. The default backend will be changed to “sox_io” backend in 0.8.0 and “sox” backend will be removed in 0.9.0. Please migrate to “sox_io” backend. Please refer to [Announcement] Improving I/O for correct and consistent experience · Issue #903 · pytorch/audio · GitHub for the detail.
warnings.warn(

[NeMo W 2022-01-24 10:19:45 experimental:27] Module <class ‘nemo.collections.asr.data.audio_to_text_dali._AudioTextDALIDataset’> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-01-24 10:19:46 nemo_logging:349] /home/jenkins/agent/workspace/tlt-pytorch-main-nightly/asr/speech_to_text/scripts/infer.py:80: UserWarning:
‘infer.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See Automatic schema-matching | Hydra for migration instructions.

Can you share the .ipynb file?

Received offline.

Could you upload a new .ipynb file? I cannot find the training log.

Hi,
As mentioned in the notebook, please run “speech_to_text finetune” with pretrained model.

Note: If you wish to proceed with a pre-trained model for better inference results, you can find a .nemo model here.
Simply re-name the .nemo file to .tlt and pass it through the finetune pipeline.

! tao speech_to_text finetune xxx -m pretrained_model

Oups!
I thought, the quartznet which is loaded at the very beginning of the notebook IS already a pre-trained model, than an additional training is performed by the an4 train set
(tao speech_to_text train …/an4_converted/train_manifest.json)
and after that a finetuning can be done with the same an4 training set (or with a different one).
Was I wrong here?

when using your link in
‘.nemo model here.’
I end up here:

and clicking on ‘ASR with NeMo’ I get
‘Hmm. We’re having trouble finding that site.’
Has the model been moved to somewhere else?
Could you recomment which pre-trained model is best to use?

This one, fetched via:
ngc registry model download-version “nvidia/tao/speechtotext_en_us_quartznet:deployable_v1.2”

does not seem to be the right one, since when I use it with -m option in the finetue step I get following errors:

[NeMo I 2022-01-31 15:02:34 features:252] PADDING: 16
[NeMo I 2022-01-31 15:02:34 features:269] STFT using torch
Error executing job with overrides: [‘exp_manager.explicit_log_dir=/results/quartznet/finetuneQNus’, ‘trainer.gpus=1’, ‘restore_from=/results/PreTrainedModels/speechtotext_en_us_quartznet_vdeployable_v1.2/quartznet_asr_set_1pt2.riva’, ‘encryption_key=tlt_encode’, ‘finetuning_ds.manifest_filepath=/data/an4_converted/train_manifest.json’, ‘validation_ds.manifest_filepath=/data/an4_converted/test_manifest.json’, ‘trainer.max_epochs=51’, ‘finetuning_ds.num_workers=20’, ‘validation_ds.num_workers=20’, ‘trainer.gpus=1’]
Traceback (most recent call last):
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/tlt_utils/connectors/save_restore_connector.py”, line 77, in restore_from
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/tlt_utils/cookbooks/nemo_cookbook.py”, line 396, in restore_from
TypeError: Archive doesn’t have the required runtime, format, version or object class type

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 211, in run_and_report
return func()
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 368, in
lambda: hydra.run(
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py”, line 110, in run
_ = ret.return_value
File “/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py”, line 233, in return_value
raise self._return_value
File “/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py”, line 160, in run_job
ret.return_value = task_function(task_cfg)
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/asr/speech_to_text/scripts/finetune.py”, line 120, in main
File “/opt/conda/lib/python3.8/site-packages/nemo/core/classes/modelPT.py”, line 270, in restore_from
instance = cls._save_restore_connector.restore_from(
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/tlt_utils/connectors/save_restore_connector.py”, line 87, in restore_from
File “/opt/conda/lib/python3.8/site-packages/nemo/core/connectors/save_restore_connector.py”, line 140, in restore_from
self._load_state_dict_from_disk(model_weights, map_location=map_location), strict=strict
File “/opt/conda/lib/python3.8/site-packages/nemo/core/connectors/save_restore_connector.py”, line 390, in _load_state_dict_from_disk
return torch.load(model_weights, map_location=map_location)
File “/opt/conda/lib/python3.8/site-packages/torch/serialization.py”, line 594, in load
with _open_file_like(f, ‘rb’) as opened_file:
File “/opt/conda/lib/python3.8/site-packages/torch/serialization.py”, line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File “/opt/conda/lib/python3.8/site-packages/torch/serialization.py”, line 211, in init
super(open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpyx9ge45
/model_weights.ckpt’

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/asr/speech_to_text/scripts/finetune.py”, line 149, in
File “/opt/conda/lib/python3.8/site-packages/nemo/core/config/hydra_runner.py”, line 101, in wrapper
_run_hydra(
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 367, in _run_hydra
run_and_report(
File “/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py”, line 251, in run_and_report
assert mdl is not None
AssertionError
2022-01-31 16:02:38,213 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I could make it run by first using the pre-trained model in a training step instead immediatly in a finetune step (as done above, giving the errors above). So I trained 400 max_epochs on an4_train, but evaluation again showed poor results:
{‘test_loss’: 42.914344787597656, ‘test_wer’: 0.8331177234649658}
and also finetuning did not improve.

so speechtotext_en_us_quartznet_vdeployable_v1.2/quartznet_asr_set_1pt2.riva does not seem be the right ‘pre-trained’ modell for obtaining better results.
Which could you recommend?
(or could you tell which models and training data were used in Speech Recognition — TAO Toolkit 3.21.11 documentation to achieve the reported, good inference results:
[NeMo I 2021-01-21 00:22:00 infer:67] The prediction results:
[NeMo I 2021-01-21 00:22:00 infer:69] File: /data/an4/wav/an4test_clstk/fcaw/an406-fcaw-b.wav
[NeMo I 2021-01-21 00:22:00 infer:70] Predicted transcript: rubout g m e f three nine
[NeMo I 2021-01-21 00:22:00 infer:69] File: /data/an4/wav/an4test_clstk/fcaw/an407-fcaw-b.wav
[NeMo I 2021-01-21 00:22:00 infer:70] Predicted transcript: erase c q q f seven

(what I also don’t understand how these goof inference results fit with the reported weak WER:
{‘test_loss’: tensor(68.1998, device=‘cuda:0’), ‘test_wer’: tensor(0.9987, device=‘cuda:0’)} ???

many thanks!

Sorry for late reply. Could you please use STT En Citrinet 512 | NVIDIA NGC ?

Thanks for coming back and for the link. I tried stt_en_citrinet_512.nemo an it seems to work much better, loss and WER on the ad4 test set are much lower!
{‘test_loss’: 1.211700439453125, ‘test_wer’: 0.051746442914009094}
But this just for the as-is, original, pre-trained model. When trying any training or finetuning step to adapt the model a bit more to the
an4 training set (in most of the training epochs the log says: " val_loss was not in top 3") the WER drops down again to
{‘test_loss’: 45.33321762084961, ‘test_wer’: 0.8331177234649658}
i.e. the training with an4 does harm more than it helps. Is there s.th. wrong with it? Have you seen that also? Or do you have any other idea?
many thanks!

Could you mkdir a new result folder and double check? Thanks.

??what exactly do you mean?
I made all experiments based on stt_en_citrinet_512.nemo in fresh result directory. These were the 3 eval runs:

UseModel=“citrinet512/TrainedWithAn4train/checkpoints/trained-model.tlt” ### after 100 Trainings epochs
OutD=“evaluateCN512” ## {‘test_loss’: 101.45941162109375, ‘test_wer’: 0.8576973080635071}
#-------
UseModel=“citrinet512/TrainedWithAn4train/trained-model_epoch_69.tlt” ### after 69 Trainings epochs (loss had smallest value at e69)
OutD=“evaluateCN512_Ep69” ## {‘test_loss’: 45.33321762084961, ‘test_wer’: 0.8331177234649658}
#-------
UseModel=“PreTrainedModels/stt_en_citrinet_512.nemo” ### without any training on the original model - this was the best!!!
OutD=“evaluateCN512_Org” ## {‘test_loss’: 1.211700439453125, ‘test_wer’: 0.051746442914009094}

!tao speech_to_text evaluate
-e $SPECS_DIR/speech_to_text/evaluate.yaml
-g 1
-k $KEY
-m $RESULTS_DIR/$UseModel
-r $RESULTS_DIR/$OutD
test_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json

OK, thanks for the confirmation.

Please run below to train an4 dataset with the .nemo file as the pretrained model.
$ tao speech_to_text finetune xxx -m pretrained_model

running finetune with the original Pretrained model did *NOT work, i.e.
the following

PreTrainedModel=“citrinet512/TrainedWithAn4train/checkpoints/trained-model.tlt”
OutD=“finetuneCNorgWithAn4”

!tao speech_to_text finetune
-e $SPECS_DIR/speech_to_text/finetune.yaml
-g 1
-k $KEY
-m $RESULTS_DIR/$PreTrainedModel
-r $RESULTS_DIR/$OutD
finetuning_ds.manifest_filepath=$DATA_DIR/an4_converted/train_manifest.json
validation_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json
trainer.max_epochs=57
finetuning_ds.num_workers=20
validation_ds.num_workers=20
trainer.gpus=1

showed an error: TypeError: change_vocabulary() got an unexpected keyword argument ‘new_vocabulary’

But when using this
PreTrainedModel=“citrinet512/TrainedWithAn4train/checkpoints/trained-model.tlt”
instead (which was the outcome of the ‘training’ step before) the finetuning runs, producing
finetuneCNorgWithAn4/finetuned-model_epoch_56.tlt

when doing evaluation using this model
UseModel=“finetuneCNorgWithAn4/run_4/finetuned-model_epoch_56.tlt”
OutD=“evaluateCN512_finetuneWithAn4”

!tao speech_to_text evaluate
-e $SPECS_DIR/speech_to_text/evaluate.yaml
-g 1
-k $KEY
-m $RESULTS_DIR/$UseModel
-r $RESULTS_DIR/$OutD
test_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json

shows a WER
{‘test_loss’: 829.0150756835938, ‘test_wer’: 0.7878395915031433}
which is ome poiunts higher but still worse - I’m currently running more
fine tuning epochs, however seeing always these Epoch xxxx, global step 1859: val_loss was not in top 3
I fear there will be not much improvement…

Yes, more finetuning epochs did not help, already the second one has smallest loss (or do I see this wrong?):
finetuneCNorgWithAn4/checkpoints/
├── finetuned-model–val_loss=50.38-epoch=2.ckpt
├── finetuned-model–val_loss=51.63-epoch=5.ckpt
├── finetuned-model–val_loss=51.64-epoch=6.ckpt
├── finetuned-model–val_loss=859.96-epoch=99-last.ckpt
└── finetuned-model.tlt

Should I change any of the (default) training/finetuning parameters?

Oupps, wanted to run evaluate with this newly finetuned model but an error occurs (also when rerunning the ok models from yesterday):
Docker instantiation failed with error: 500 Server Error: Internal Server Error (“OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown”)

??? tried jupyter restart etc. didn’ help
Any proposal what to do?

How about running
$ nvidia-container-cli --load-kmods info