Speech_to_text_citrinet infer yields random transcription results

I’m trying to transcribe audio files using tao speech_to_text_citrinet infer with the following pre-trained citrinet model:
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_en_us_citrinet

ngc registry model download-version "nvidia/tao/speechtotext_en_us_citrinet:trainable_v3.0"

I tried with different .wav files (mono & 16k sample-rate), the infer call runs without any errors, but the transcription results I get are completely random: e.g. “case return individual who individual return sc return …”

Here the notebook I’m using:
tao-inference.ipynb (20.0 KB)

System Configuration
• Ubuntu 18.04
• Hardware: Tesla V100-SXM2-16GB, CUDA version: 11.1
• TLT Version: 3.0
• TAO version: 3.21.11

Could you share the link or several files?

Sure. Here two sample files:

Could you share some results when you run above two files? You mention that it will get random result.

Yes of course.
The hello world sample gives the following result:

[NeMo I 2022-03-02 22:11:19 infer:71] File: /data/hello_world.wav
[NeMo I 2022-03-02 22:11:19 infer:72] Predicted transcript: sc which them seven sc return

And the second sample gives something similar to that: case return individual who individual return sc return … (the word “return” appears all the time in the transcription, while hardly occuring in the original audio)

So, on your side, the inference result are not correct at all.
Could you try to run inference against some an4 audio files mentioned in jupyter notebook?

https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/resources/speechtotext_citrinet_notebook/version/v1.3/files/speech-to-text-training.ipynb

I just tried to download an4 but it seems to be down: http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz

I saw in the training notebook that they preprocess the audio files using tao speech_to_text_citrinet dataset_convert, so I wanted to apply that to my audio files:

! tao speech_to_text_citrinet dataset_convert \
    -e $SPECS_DIR/speech_to_text_citrinet/dataset_convert_en.yaml \
    -r $RESULTS_DIR/citrinet/dataset_convert \
    source_data_dir=$DATA_DIR/ \
    target_data_dir=$DATA_DIR/test_converted

This however yields the following error:

FileNotFoundError: [Errno 2] No such file or directory: '/data/train.tsv

I can’t find anything related to that .tsv file in the documentation or how to run dataset_convert

Please note that $DATA_DIR should be a path inside the docker. The mapping is defined in tao_mounts.json file.

That I understand, and I put my wav files in the $DATA_DIR. However, dataset_convert seems to be looking for a train.tsv file, but I can’t find the specs for the format of this file in the documentation. (I imagine it should contain the paths of the audio files etc…)

I can not understand. Could you try to run the official notebook firstly?

What I’m trying to do is converting my audio files using the dataset_convert command. But this yields the following error:

FileNotFoundError: [Errno 2] No such file or directory: '/data/train.tsv

…Meaning that it expects a train.tsv file in the $DATA_DIR - and I don’t know what should be the content of this .tsv file.

The original notebook doesn’t work because the an4 dataset seems to be offline.

Could you follow the notebook and download an4 dataset?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Refer to Speech_to_text_citrinet infer yields random transcription results - #5 by Morganh