Speech_to_text_citrinet infer yields random transcription results

Speech_to_text_citrinet infer yields random transcription results - The problem is described here, but it is not solved.

Any recognized file produces text like “individual case return sc case transform them transform return sc return sc return case return case sc case sc does sc individual return case return still scie still transformie transform return case w”. I tried AN4 dataset recognition, but it didn’t help either, the recognized text was about the same. The only thing is that I downloaded the dataset from another source and converted from sph format to wav 16khz using audacity.

I also tried the Russian model, the recognized text is always different from what is pronounced in the audio file.

Could you please refer to Tao speech_to_text evaluate+infer show very weak results - #26 by Morganh and run some experiments?
In that topic, I was running with speech_to_text. The result is fine.

For your case, when run speech-to-text-citrinet, you can use Speech to Text English Citrinet | NVIDIA NGC

But I don’t need to run evaluate. I just want to check recognition quality using infer. I ran the command

tao speech_to_text_citrinet infer -e /specs/speech_to_text_citrinet/infer.yaml -g 1 -k tlt_encode -m /results/citrinet/speechtotext_english_citrinet.tlt -r /results/citrinet/infer file_paths=[/data/an268-mbmg-b.wav ]

using the checkpoint you sent me, the result was “university was university one university one” which is completely different from what is pronounced in the file.

I ran all the commands as per the notepad via console, all folders are mounted for tao docker.

I just completed all the steps to prepare AH4 already using the official nvidia notebook, Speech to Text Citrinet Notebook | NVIDIA NGC the result of recognizing files in the notepad using the “ASR Inference” cell is identical results of file recognition in the console - a set of random words. All I did was download the model and follow the instructions on the site. I think that there may be some mistake on your part, perhaps you updated something recently.

Hi,
There might be something wrong in that version of ngc pretrained model.
Please use below instead.

wget https://api.ngc.nvidia.com/v2/models/nvidia/tao/speechtotext_english_citrinet/versions/trainable_v1.7/files/speechtotext_english_citrinet_1024.tlt

I run inference against it. Previous issue is gone.
And also run evaluation, the WER is only about 2.4579%

speech_to_text_citrinet evaluate -e specs/speech_to_text_citrinet/evaluate.yaml -k tlt_encode -m **speechtotext_english_citrinet_1024.tlt** -r evalution_speech_to_text_citrinet_ngc_tlt test_ds.manifest_filepath=data/an4_converted/test_manifest.json
DATALOADER:0 TEST RESULTS
{'test_loss': 0.520318329334259, 'test_wer': 0.02457956038415432}
speech_to_text_citrinet infer -e specs/speech_to_text/infer.yaml -k tlt_encode -m speechtotext_english_citrinet_1024.tlt -r infer_result file_paths=[data/an4_converted/wavs/an406-fcaw-b.wav]
[NeMo I 2022-04-14 03:41:32 infer:72] Predicted transcript: rabout g m e f three nine

Indeed, using this checkpoint I was able to get great results on different audio files even outside the AH4 dataset. But why are the checkpoints for the Russian and English versions available at RIVA Citrinet ASR Russian | NVIDIA NGC and RIVA Citrinet ASR English | NVIDIA NGC not recognized correctly? I also want to check the quality of models for other languages.

Still checking. Not sure if there is something mismatching.
Could you try to run speech-to-text instead of speech-to-text-citrinet for these two models you mentioned?

For both models, when running speech_to_text instead of speech_to_text_citrinet I get the error:
FileNotFoundError: [Errno 2] No such file or directory: ‘/tmp/tmpjatcuk3h/model_weights.ckpt’

Oh, OK, please ignore my request. These models should only run with speech_to_text_citrinet.

Should I do something else? Or should I wait for your answer?

The internal team is involved to check. There is no result yet for those two models. As mentioned above, for English version, please use
$ wget https://api.ngc.nvidia.com/v2/models/nvidia/tao/speechtotext_english_citrinet/versions/trainable_ v1.7 /files/speechtotext_english_citrinet_1024.tlt

Hi, Morganh. Is there any news for these two models?

Yes, the issue has been addressed. Internally team is working on new ones.

New ones are available in RIVA Citrinet ASR English | NVIDIA NGC
RIVA Citrinet ASR Russian | NVIDIA NGC