[BUG] Conformer CTC streaming ASR with timestamps enabled returns weird start time of first word

ilb · May 5, 2022, 2:33pm

Please provide the following information when requesting support.

Hardware - GPU (T4)
Hardware - CPU 4
Operating System ubuntu:20.04
Riva Version 2.0.0

The result of a simple test gives me the following timestamps

Result: alternatives {
  transcript: "good morning"
  words {
    start_time: 1306600
    end_time: 200
    word: "good"
  }
  words {
    start_time: 440
    end_time: 520
    word: "morning"
  }
}

The start_time of the first word is always 1306600. This seems to be a bug, I would expect the true start_time of the word.

rvinobha · May 5, 2022, 4:33pm

Hi @ilb

Thanks for your interest in Riva,

Thanks for letting us know about this, I will check with the team further and provide an update

rleary · May 19, 2022, 9:53pm

Hi, can you please share the audio file in question?

shahin.konadath · June 5, 2022, 9:38pm

We are also facing the same issue. Currently its a blocker on our live systems.

rvinobha · June 6, 2022, 5:00pm

Hi @shahin.konadath

Thanks for your interest in Riva,

Can you please share the audio file/sample that leads or causes the issue (weird start time),

Thanks

ilb · June 6, 2022, 10:25pm

Unfortunately this is not possible. Even if I share the file, you will not be able to test. I am using a custom Conformer-CTC model, for a Non-English language, which I can not share. The model was built using nemo:1.8.2, following the standard training procedures, and later converted first to riva and then to rmir and model following the steps described in the Riva 2.2.0 documentation. I have no problems running the model using nemo alone. I have issues only when I deploy it to Riva. I have tried with nn.use_onnx_runtime, as well as converting to trt, which by the way takes an awful long time to convert to model (~34mins trt vs ~2min onnx). I am running Riva on a T4 GPU. I have tried disabling vad, and I am planning to try with a greedy decoder to exclude any potential issue caused by using a language model and lexicon. Regardless of the approach, I always get final results where the first word in the transcript has an invalid, constant, timestamp, e.g.

{'results': [{'alternatives': [{'transcript': '....', 'confidence': 1.0, 'words': [{'startTime': 1303680, 'endTime': 22400, 'word': '...'}, {'startTime': 22640, 'endTime': 22720, 'word': '...'}, ..., {'startTime': 27920, 'endTime': 28000, 'word': '...'}]}], 'isFinal': True, 'channelTag': 1, 'audioProcessed': 29.599977}]}

The value is slightly different depending if I run on Riva 2.0.0 or 2.2.0, but it is always the first word a transcript. Either the first in the entire transcript of the file or mid file immediately following a previous vad segment (is_final).

ilb · June 6, 2022, 10:29pm

Update; conversion to rmir with a greedy decoder succeeds with no errors, so does deploying to Riva, i.e. the model loads successfully, but when I try to transcribe anything I get no transcripts back. I am running examples/transcribe_file_verbose.py with additional printouts of every response, but in the case of a greedy decoder there are simply none.

shahin.konadath · June 7, 2022, 7:52am

Exactly similar issue we are facing. We build the rmir file with the following on a v100 GPU.

riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --name=conformer-en-US-asr-streaming-throughput \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=40 \
   --nn.fp16_needs_obey_precision_pass \
   --vad.vad_start_history=200 \
   --vad.residue_blanks_at_start=-2 \
   --chunk_size=0.8 \
   --left_padding_size=1.6 \
   --right_padding_size=1.6 \
   --decoder_type=flashlight \
   --flashlight_decoder.asr_model_delay=-1 \
   --decoding_language_model_binary=<lm_binary> \
   --decoding_vocab=<decoder_vocab_file> \
   --flashlight_decoder.lm_weight=0.2 \
   --flashlight_decoder.word_insertion_score=0.2 \
   --flashlight_decoder.beam_threshold=20. \
   --language_code=en-US

The build took almost 20 minutes to convert the model to TRT plan. We tried the service-maker and riva-server from 2.2.0 versions. Still most of the time the transcripts get the weird start_time (start_time is constant and much higher than end_time) and the confidence value is always 1.0

ilb · June 27, 2022, 7:10pm

@rvinobha any updates on this?

rvinobha · June 28, 2022, 5:48am

Hi @ilb

Thanks for your interest in Riva

Currently our internal team is unable to reproduce the issue, because we are not having the audio file in question (that leads to weird start time)

I kindly request to please share the audio file (even if it is non english)

Thanks

ilb · June 28, 2022, 11:48am

I have uploaded an example audio file. Here in my case the invalid timestamp arises after a final, which finishes at 21240ms, and the next final whose first word ends at 22400ms. This word has an invalid timestamp of 1302720. All interim results prior to this final had this same startTime timestamp for this word. There are of course other examples, but I’m not listing them.

I have tried transcribing with the en-US conformer-CTC, but there I see no words, where starTime >> endTime.

rvinobha · June 28, 2022, 1:15pm

Hi @ilb

Thanks for your interest in Riva

Thank you so much for sharing the audio file, we really appreciate it,
I will share this audio file with the team and provide updates on the issue soon

Thanks

rvinobha · July 13, 2022, 7:25am

Hi @shahin.konadath and @ilb

Thanks for your interest in Riva

I have some updates from the team,

We will soon fix this bug in upcoming releases,

As Workaround, In order to overcome the issue, please set CTC Decoder configuration’s parameter asr_model_delay to 0 (zero)

For Example if you are using flashlight_decoder, then
--flashlight_decoder.asr_model_delay=0
Same is applicable for other decoders,

Let us know if you need any other information or help on the same

Thanks for your patience and apologies for the delay

ilb · July 20, 2022, 11:03pm

@rvinobha Unfortunately in my case this did not help. I have tested with pytorch:22.06-py3+nemo:1.11.0rc0 and riva 2.3.0. I built the model rmir with the following command:

riva-build speech_recognition \
   <path_to_rmir> <path_to_riva> \
   --force \
   --name=<model_name> \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=40 \
   --nn.use_onnx_runtime \
   --nn.fp16_needs_obey_precision_pass \
   --vad.vad_start_history=200 \
   --vad.residue_blanks_at_start=-2 \
   --chunk_size=0.16 \
   --left_padding_size=1.92 \
   --right_padding_size=1.92 \
   --decoder_type=flashlight \
   --flashlight_decoder.asr_model_delay=0 \
   --decoding_language_model_arpa=<pruned_arpa_lm> \
   --rescoring_language_model_carpa=<carpa_lm> \
   --decoding_lexicon=<bpe_tokenized_lexicon> \
   --flashlight_decoder.lm_weight=0.2 \
   --flashlight_decoder.word_insertion_score=0.2 \
   --flashlight_decoder.beam_threshold=20. \
   --language_code=<language_code>

rvinobha · July 25, 2022, 5:23pm

Hi @ilb

Apologies it is not working

Thanks for your feedback, Appreciate it, I will check further with the team and will update soon

Thanks

Topic		Replies	Views
NVIDIA Riva ASR failed start with WFST decoders Riva riva	5	516	March 29, 2024
RIVA ASR StreamingRecognition low confidence for word transcripts Riva	1	487	November 29, 2023
RIVA error, when deploying official Conformer ASR network Riva riva	10	1944	January 27, 2023
Riva 1.8 riva_start.sh fail when build with language model Riva riva	3	1169	July 27, 2022
Nvidia Riva health check fail Riva riva	1	463	February 14, 2025
Riva model deployment issue Riva inception	8	1560	April 4, 2024
Finetuned ASR conformer returns only empty transcripts Riva	13	953	October 20, 2022
Riva Quickstart 2.1.0 installation fails on AGX Orin Riva riva	13	1409	October 17, 2022
Failed to get riva started Riva riva	7	1721	December 3, 2022
How can I start Riva without an error Riva riva	7	2545	September 29, 2021

[BUG] Conformer CTC streaming ASR with timestamps enabled returns weird start time of first word

Related topics