[BUG] Conformer CTC streaming ASR with timestamps enabled returns weird start time of first word

Please provide the following information when requesting support.

Hardware - GPU (T4)
Hardware - CPU 4
Operating System ubuntu:20.04
Riva Version 2.0.0

The result of a simple test gives me the following timestamps

Result: alternatives {
  transcript: "good morning"
  words {
    start_time: 1306600
    end_time: 200
    word: "good"
  }
  words {
    start_time: 440
    end_time: 520
    word: "morning"
  }
}

The start_time of the first word is always 1306600. This seems to be a bug, I would expect the true start_time of the word.

1 Like

Hi @ilb

Thanks for your interest in Riva,

Thanks for letting us know about this, I will check with the team further and provide an update

Hi, can you please share the audio file in question?

We are also facing the same issue. Currently its a blocker on our live systems.

Hi @shahin.konadath

Thanks for your interest in Riva,

Can you please share the audio file/sample that leads or causes the issue (weird start time),

Thanks

Unfortunately this is not possible. Even if I share the file, you will not be able to test. I am using a custom Conformer-CTC model, for a Non-English language, which I can not share. The model was built using nemo:1.8.2, following the standard training procedures, and later converted first to riva and then to rmir and model following the steps described in the Riva 2.2.0 documentation. I have no problems running the model using nemo alone. I have issues only when I deploy it to Riva. I have tried with nn.use_onnx_runtime, as well as converting to trt, which by the way takes an awful long time to convert to model (~34mins trt vs ~2min onnx). I am running Riva on a T4 GPU. I have tried disabling vad, and I am planning to try with a greedy decoder to exclude any potential issue caused by using a language model and lexicon. Regardless of the approach, I always get final results where the first word in the transcript has an invalid, constant, timestamp, e.g.

{'results': [{'alternatives': [{'transcript': '....', 'confidence': 1.0, 'words': [{'startTime': 1303680, 'endTime': 22400, 'word': '...'}, {'startTime': 22640, 'endTime': 22720, 'word': '...'}, ..., {'startTime': 27920, 'endTime': 28000, 'word': '...'}]}], 'isFinal': True, 'channelTag': 1, 'audioProcessed': 29.599977}]}

The value is slightly different depending if I run on Riva 2.0.0 or 2.2.0, but it is always the first word a transcript. Either the first in the entire transcript of the file or mid file immediately following a previous vad segment (is_final).

Update; conversion to rmir with a greedy decoder succeeds with no errors, so does deploying to Riva, i.e. the model loads successfully, but when I try to transcribe anything I get no transcripts back. I am running examples/transcribe_file_verbose.py with additional printouts of every response, but in the case of a greedy decoder there are simply none.

Exactly similar issue we are facing. We build the rmir file with the following on a v100 GPU.

riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --name=conformer-en-US-asr-streaming-throughput \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=40 \
   --nn.fp16_needs_obey_precision_pass \
   --vad.vad_start_history=200 \
   --vad.residue_blanks_at_start=-2 \
   --chunk_size=0.8 \
   --left_padding_size=1.6 \
   --right_padding_size=1.6 \
   --decoder_type=flashlight \
   --flashlight_decoder.asr_model_delay=-1 \
   --decoding_language_model_binary=<lm_binary> \
   --decoding_vocab=<decoder_vocab_file> \
   --flashlight_decoder.lm_weight=0.2 \
   --flashlight_decoder.word_insertion_score=0.2 \
   --flashlight_decoder.beam_threshold=20. \
   --language_code=en-US

The build took almost 20 minutes to convert the model to TRT plan. We tried the service-maker and riva-server from 2.2.0 versions. Still most of the time the transcripts get the weird start_time (start_time is constant and much higher than end_time) and the confidence value is always 1.0

@rvinobha any updates on this?

Hi @ilb

Thanks for your interest in Riva

Currently our internal team is unable to reproduce the issue, because we are not having the audio file in question (that leads to weird start time)

I kindly request to please share the audio file (even if it is non english)

Thanks

I have uploaded an example audio file. Here in my case the invalid timestamp arises after a final, which finishes at 21240ms, and the next final whose first word ends at 22400ms. This word has an invalid timestamp of 1302720. All interim results prior to this final had this same startTime timestamp for this word. There are of course other examples, but I’m not listing them.

I have tried transcribing with the en-US conformer-CTC, but there I see no words, where starTime >> endTime.

Hi @ilb

Thanks for your interest in Riva

Thank you so much for sharing the audio file, we really appreciate it,
I will share this audio file with the team and provide updates on the issue soon

Thanks

Hi @shahin.konadath and @ilb

Thanks for your interest in Riva

I have some updates from the team,

We will soon fix this bug in upcoming releases,

As Workaround, In order to overcome the issue, please set CTC Decoder configuration’s parameter asr_model_delay to 0 (zero)

For Example if you are using flashlight_decoder, then
--flashlight_decoder.asr_model_delay=0
Same is applicable for other decoders,

Let us know if you need any other information or help on the same

Thanks for your patience and apologies for the delay

@rvinobha Unfortunately in my case this did not help. I have tested with pytorch:22.06-py3+nemo:1.11.0rc0 and riva 2.3.0. I built the model rmir with the following command:

riva-build speech_recognition \
   <path_to_rmir> <path_to_riva> \
   --force \
   --name=<model_name> \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=40 \
   --nn.use_onnx_runtime \
   --nn.fp16_needs_obey_precision_pass \
   --vad.vad_start_history=200 \
   --vad.residue_blanks_at_start=-2 \
   --chunk_size=0.16 \
   --left_padding_size=1.92 \
   --right_padding_size=1.92 \
   --decoder_type=flashlight \
   --flashlight_decoder.asr_model_delay=0 \
   --decoding_language_model_arpa=<pruned_arpa_lm> \
   --rescoring_language_model_carpa=<carpa_lm> \
   --decoding_lexicon=<bpe_tokenized_lexicon> \
   --flashlight_decoder.lm_weight=0.2 \
   --flashlight_decoder.word_insertion_score=0.2 \
   --flashlight_decoder.beam_threshold=20. \
   --language_code=<language_code>

Hi @ilb

Apologies it is not working

Thanks for your feedback, Appreciate it, I will check further with the team and will update soon

Thanks