Following the description in
https://docs.nvidia.com/tao/tao-toolkit/text/asr/speech_recognition.html
(resp. “nvidia/tao/speechtotext_notebook:v1.3”)
I trained QuartzNet 400 epochs on an4 trainset.
Evaluation shows
{‘test_loss’: 58.62380599975586, ‘test_wer’: 0.8576973080635071}
a WER of 85% is very weak, I suppose!
Also when running inference on the test set, the results also don’t look great, e.g.:
[NeMo I 2022-01-19 16:03:52 infer:70] The prediction results:
[NeMo I 2022-01-19 16:03:52 infer:72] File: /data/an4_converted/wavs/an406-fcaw-b.wav
[NeMo I 2022-01-19 16:03:52 infer:73] Predicted transcript: rubout sey nine
[NeMo I 2022-01-19 16:03:52 infer:72] File: /data/an4_converted/wavs/an407-fcaw-b.wav
[NeMo I 2022-01-19 16:03:52 infer:73] Predicted transcript: erase o tt ie
[NeMo I 2022-01-19 16:03:52 infer:72] File: /data/an4_converted/wavs/an408-fcaw-b.wav
[NeMo I 2022-01-19 16:03:52 infer:73] Predicted transcript: o t t fe thre
…
at least compared to the recognition results you give in
https://docs.nvidia.com/tao/tao-toolkit/text/asr/speech_recognition.html
(I also tried fine-tuning on train-set - no improvement.
What might I have done wrong (I didn’t change much things besides
trainer.max_epochs=400)? Did’t you use other training data as well?
Other parameter values (trainer.max_epochs…) that the ones given in the doc?
or did you use some kind of N-grams (or other language models) for improving the recognized character sequences (is that described somewhere)?
Is it possible also to output more detailed results (per utterance, Insertions,Deletions,Substitutions) when doing !tao speech_to_text evaluate … ??
Thanks for any hint!
having run now several trainings/fine tunings and testet following 3 on the 130 utterances of an4_converted/test_manifest.json showed a max. Wordaccuracy of only 18.2 %:
grep “Percent Word Accuracy” InferResults/*.summary | sed ‘s|^|##> |’
##> InferResults/an4_test.res.summary:Percent Word Accuracy = 13.2%
##> InferResults/an4_test_FT20220118.res.summary:Percent Word Accuracy = 18.2%
##> InferResults/an4_test_X.res.summary:Percent Word Accuracy = 14.2%
not surprising, because letters are recognized as words here - how can you improve that? Is there a language model for the letter sequences?
what do I do wrong?
any hint for that?