Speech_to_text_citrinet infer yields random transcription results

nicolas20 · March 2, 2022, 10:27pm

I’m trying to transcribe audio files using tao speech_to_text_citrinet infer with the following pre-trained citrinet model:
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_en_us_citrinet

ngc registry model download-version "nvidia/tao/speechtotext_en_us_citrinet:trainable_v3.0"

I tried with different .wav files (mono & 16k sample-rate), the infer call runs without any errors, but the transcription results I get are completely random: e.g. “case return individual who individual return sc return …”

Here the notebook I’m using:
tao-inference.ipynb (20.0 KB)

System Configuration
• Ubuntu 18.04
• Hardware: Tesla V100-SXM2-16GB, CUDA version: 11.1
• TLT Version: 3.0
• TAO version: 3.21.11

Morganh · March 3, 2022, 4:09pm

Could you share the link or several files?

nicolas20 · March 3, 2022, 4:18pm

Sure. Here two sample files:

Morganh · March 3, 2022, 4:21pm

Could you share some results when you run above two files? You mention that it will get random result.

nicolas20 · March 3, 2022, 4:26pm

Yes of course.
The hello world sample gives the following result:

[NeMo I 2022-03-02 22:11:19 infer:71] File: /data/hello_world.wav
[NeMo I 2022-03-02 22:11:19 infer:72] Predicted transcript: sc which them seven sc return

And the second sample gives something similar to that: case return individual who individual return sc return … (the word “return” appears all the time in the transcription, while hardly occuring in the original audio)

Morganh · March 3, 2022, 4:47pm

So, on your side, the inference result are not correct at all.
Could you try to run inference against some an4 audio files mentioned in jupyter notebook?

https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/resources/speechtotext_citrinet_notebook/version/v1.3/files/speech-to-text-training.ipynb

nicolas20 · March 3, 2022, 7:04pm

I just tried to download an4 but it seems to be down: http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz

I saw in the training notebook that they preprocess the audio files using tao speech_to_text_citrinet dataset_convert, so I wanted to apply that to my audio files:

! tao speech_to_text_citrinet dataset_convert \
    -e $SPECS_DIR/speech_to_text_citrinet/dataset_convert_en.yaml \
    -r $RESULTS_DIR/citrinet/dataset_convert \
    source_data_dir=$DATA_DIR/ \
    target_data_dir=$DATA_DIR/test_converted

This however yields the following error:

FileNotFoundError: [Errno 2] No such file or directory: '/data/train.tsv

I can’t find anything related to that .tsv file in the documentation or how to run dataset_convert

Morganh · March 4, 2022, 1:07pm

Please note that $DATA_DIR should be a path inside the docker. The mapping is defined in tao_mounts.json file.

nicolas20 · March 4, 2022, 2:08pm

That I understand, and I put my wav files in the $DATA_DIR. However, dataset_convert seems to be looking for a train.tsv file, but I can’t find the specs for the format of this file in the documentation. (I imagine it should contain the paths of the audio files etc…)

Morganh · March 4, 2022, 2:48pm

I can not understand. Could you try to run the official notebook firstly?

nicolas20 · March 4, 2022, 3:36pm

What I’m trying to do is converting my audio files using the dataset_convert command. But this yields the following error:

FileNotFoundError: [Errno 2] No such file or directory: '/data/train.tsv

…Meaning that it expects a train.tsv file in the $DATA_DIR - and I don’t know what should be the content of this .tsv file.

The original notebook doesn’t work because the an4 dataset seems to be offline.

Morganh · March 4, 2022, 4:55pm

Could you follow the notebook and download an4 dataset?

system · March 18, 2022, 4:55pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Morganh · April 14, 2022, 3:49am

Refer to Speech_to_text_citrinet infer yields random transcription results - #5 by Morganh