ngc registry model download-version "nvidia/tao/speechtotext_en_us_citrinet:trainable_v3.0"
I tried with different .wav files (mono & 16k sample-rate), the infer call runs without any errors, but the transcription results I get are completely random: e.g. “case return individual who individual return sc return …”
Yes of course.
The hello world sample gives the following result:
[NeMo I 2022-03-02 22:11:19 infer:71] File: /data/hello_world.wav
[NeMo I 2022-03-02 22:11:19 infer:72] Predicted transcript: sc which them seven sc return
And the second sample gives something similar to that: case return individual who individual return sc return … (the word “return” appears all the time in the transcription, while hardly occuring in the original audio)
So, on your side, the inference result are not correct at all.
Could you try to run inference against some an4 audio files mentioned in jupyter notebook?
I saw in the training notebook that they preprocess the audio files using tao speech_to_text_citrinet dataset_convert, so I wanted to apply that to my audio files:
That I understand, and I put my wav files in the $DATA_DIR. However, dataset_convert seems to be looking for a train.tsv file, but I can’t find the specs for the format of this file in the documentation. (I imagine it should contain the paths of the audio files etc…)