RIVA en-US when using LM, interim results with stability change drop already predicted but less stable words

Environment:

  • Nemo 1.12.0 (nemo:22.08)
  • Riva 2.7.0
  • T4
  • English Conformer CTC model + LM + Neural-based VAD all downloaded from links in Riva documentation.

The model was deployed with the following pipeline configuration

riva-build speech_recognition \
  <RMIR> <RIVA> <VAD> \
   --name=<name> \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=40 \
   --nn.use_onnx_runtime \
   --nn.fp16_needs_obey_precision_pass \
   --chunk_size=0.16 \
   --left_padding_size=1.92 \
   --right_padding_size=1.92 \
   --decoder_type=flashlight \
   --decoding_vocab=<vocab> \
   --decoding_language_model_arpa=<arpa> \
   --rescoring_language_model_carpa=<carpa> \
   --flashlight_decoder.lm_weight=0.8 \
   --flashlight_decoder.word_insertion_score=1 \
   --flashlight_decoder.beam_threshold=20.0 \
   --flashlight_decoder.beam_size=32 \
   --flashlight_decoder.beam_size_token=10 \
   --flashlight_decoder.num_tokenization=1 \
   --flashlight_decoder.asr_model_delay=-1 \
   --endpointing.residue_blanks_at_start=-2 \
   --vad_type=neural \
   --neural_vad_nn.optimization_graph_level=-1

The test file was a section of Bill Gates’ Harvard Commencement Speech.

In the trace below there is a noticeable discontinuity between the end of segment that reached stability 0.9 and start of segment with stability 0.1 (marked with the pipe symbol in the following trace):

...
idx: 214
response: {'results': [{'alternatives': [{'transcript': 'members of the faculty', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32920, 'endTime': 32960, 'word': 'of'}, {'startTime': 33080, 'endTime': 33120, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.239975}]}

members of the faculty
----

idx: 215
response: {'results': [{'alternatives': [{'transcript': 'members of the faculty', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32920, 'endTime': 32960, 'word': 'of'}, {'startTime': 33080, 'endTime': 33120, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.399975}]}

members of the faculty
----

idx: 216
response: {'results': [{'alternatives': [{'transcript': 'of the faculty parents', 'words': [{'startTime': 32920, 'endTime': 32960, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}, {'startTime': 34240, 'endTime': 34560, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.559975}]}

of the faculty parents
----

idx: 217
response: {'results': [{'alternatives': [{'transcript': 'of the faculty parents', 'words': [{'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}, {'startTime': 34280, 'endTime': 34640, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.719975}]}

of the faculty parents
----

idx: 218
response: {'results': [{'alternatives': [{'transcript': 'members ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 34.879974}, {'alternatives': [{'transcript': 'the faculty parents', 'words': [{'startTime': 33040, 'endTime': 33080, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}, {'startTime': 34280, 'endTime': 34680, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.879974}]}

members |the faculty parents
----

idx: 219
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.039974}, {'alternatives': [{'transcript': 'faculty parents', 'words': [{'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}, {'startTime': 34280, 'endTime': 34680, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.039974}]}

members of the |faculty parents
----

idx: 220
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.199974}, {'alternatives': [{'transcript': 'cult parents', 'words': [{'startTime': 33360, 'endTime': 33520, 'word': 'cult'}, {'startTime': 34280, 'endTime': 34680, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.199974}]}

members of the |cult parents
----

idx: 221
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.359974}, {'alternatives': [{'transcript': 'parents', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.359974}]}

members of the |parents
----

idx: 222
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.519974}, {'alternatives': [{'transcript': 'parents and', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35280, 'endTime': 35320, 'word': 'and'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.519974}]}

members of the |parents and
----

idx: 223
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.679974}, {'alternatives': [{'transcript': 'parents and especially', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35320, 'endTime': 35360, 'word': 'and'}, {'startTime': 35400, 'endTime': 35760, 'word': 'especially'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.679974}]}

members of the |parents and especially
----

idx: 224
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.839973}, {'alternatives': [{'transcript': 'parents and especially', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35320, 'endTime': 35360, 'word': 'and'}, {'startTime': 35440, 'endTime': 35880, 'word': 'especially'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.839973}]}

members of the |parents and especially
----

idx: 225
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.999973}, {'alternatives': [{'transcript': 'parents and especially the', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35320, 'endTime': 35360, 'word': 'and'}, {'startTime': 35480, 'endTime': 35920, 'word': 'especially'}, {'startTime': 36000, 'endTime': 36040, 'word': 'the'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.999973}]}

members of the |parents and especially the
----

idx: 226
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 36.159973}, {'alternatives': [{'transcript': 'parents and especially the great', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35360, 'endTime': 35400, 'word': 'and'}, {'startTime': 35480, 'endTime': 35920, 'word': 'especially'}, {'startTime': 35960, 'endTime': 36000, 'word': 'the'}, {'startTime': 36000, 'endTime': 36160, 'word': 'great'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 36.159973}]}

members of the |parents and especially the great
----

idx: 227
response: {'results': [{'alternatives': [{'transcript': 'members of the faculty ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 36.319973}, {'alternatives': [{'transcript': 'and especially the gradu', 'words': [{'startTime': 35360, 'endTime': 35400, 'word': 'and'}, {'startTime': 35480, 'endTime': 35960, 'word': 'especially'}, {'startTime': 36000, 'endTime': 36040, 'word': 'the'}, {'startTime': 36080, 'endTime': 36320, 'word': 'gradu'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 36.319973}]}

members of the faculty |and especially the gradu
----
...

The discontinuity becomes evident only when one uses the flashlight decoder, with a greedy decoder there is no such problem. In fact no stability change occurs at all, in the case of a greedy decoder the stability=0.1 for the entire duration of interim results.

How should I change the pipeline configuration parameters to diminish/remove this discontinuity (i.e. not have lost word/s), and decrease the latency to the 0.1/0.9 split (in the trace above approx. 1.5s of the most recent transcript has stability 0.1, older transcripts have stability 0.9). I would like to reduce this number as much as possible.

Hi @ilb

Thanks for your interest in Riva

I will check regarding this issue with the internal team

Thanks

@rvinobha any news on this? For my use case it is really problematic.

Hi @ilb

Apologies for the delay,

The Internal team are still debugging the issue, Once i have an update will provide you

Thanks

@rvinobha any updates?

Hi @ilb

Sincere Apologies, I still don’t have updates on this thread, have asked again the concerned team to check and provide feedback at earliest

Thanks

@rvinobha any progress? Actually any info would be really welcome, as this is quite a major drawback for my use case.

HI @ilb

Sincere Apologies, i have pushed for updates, will push again,
Still I am waiting for an update on this internal ticket created

Thanks

@rvinobha I’ve tested with the latest riva version (2.9.0) and I still have the same issue. Any news from the team? If a different model would be more suitable, please tell.

With the transcribe_file.py example script, the issue can be seen as follows. Assume an intermediate transcript where some of it has stability 0.9 and some stability 0.1. When time progresses and new intermediate results are displayed, the portion with stability 0.9 will become longer, individual words from the start of portion with stability 0.1 will get removed from stability 0.1 and get appended to the portion with stability 0.9. However, oftentimes they (even if the word was correct in the first place) first change to something else or disappear completely, before they reappear in the correct form at the end of the transcript with stability 0.9.

Based on experiments there seems to be a relation with the parameters left_padding_size, right_padding_size and chunk_size, i.e. the stability 0.1 seems to contain right_padding_size audio data, followed by chunk_size data, a default value of 0.16, which could lead to “dissapearing words” (most but the shortest words take more than 0.16s)?

Hi @ilb

Apologies for the long delay

I have feedback on this issue from the internal team

“Yes we expect stability to improve with cache-aware. The decoder could be tuned to give better stability e.g. reducing beam-size, and beam-size-token. partial transcripts are being returned here also. We could just look at final transcripts which are stable. Using the streaming-throughput configuration should also produce more stable results given the longer chunk size.”

let me know if you have further question, I will address it with the team