RIVA en-US when using LM, interim results with stability change drop already predicted but less stable words

Environment:

  • Nemo 1.12.0 (nemo:22.08)
  • Riva 2.7.0
  • T4
  • English Conformer CTC model + LM + Neural-based VAD all downloaded from links in Riva documentation.

The model was deployed with the following pipeline configuration

riva-build speech_recognition \
  <RMIR> <RIVA> <VAD> \
   --name=<name> \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=40 \
   --nn.use_onnx_runtime \
   --nn.fp16_needs_obey_precision_pass \
   --chunk_size=0.16 \
   --left_padding_size=1.92 \
   --right_padding_size=1.92 \
   --decoder_type=flashlight \
   --decoding_vocab=<vocab> \
   --decoding_language_model_arpa=<arpa> \
   --rescoring_language_model_carpa=<carpa> \
   --flashlight_decoder.lm_weight=0.8 \
   --flashlight_decoder.word_insertion_score=1 \
   --flashlight_decoder.beam_threshold=20.0 \
   --flashlight_decoder.beam_size=32 \
   --flashlight_decoder.beam_size_token=10 \
   --flashlight_decoder.num_tokenization=1 \
   --flashlight_decoder.asr_model_delay=-1 \
   --endpointing.residue_blanks_at_start=-2 \
   --vad_type=neural \
   --neural_vad_nn.optimization_graph_level=-1

The test file was a section of Bill Gates’ Harvard Commencement Speech.

In the trace below there is a noticeable discontinuity between the end of segment that reached stability 0.9 and start of segment with stability 0.1 (marked with the pipe symbol in the following trace):

...
idx: 214
response: {'results': [{'alternatives': [{'transcript': 'members of the faculty', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32920, 'endTime': 32960, 'word': 'of'}, {'startTime': 33080, 'endTime': 33120, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.239975}]}

members of the faculty
----

idx: 215
response: {'results': [{'alternatives': [{'transcript': 'members of the faculty', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32920, 'endTime': 32960, 'word': 'of'}, {'startTime': 33080, 'endTime': 33120, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.399975}]}

members of the faculty
----

idx: 216
response: {'results': [{'alternatives': [{'transcript': 'of the faculty parents', 'words': [{'startTime': 32920, 'endTime': 32960, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}, {'startTime': 34240, 'endTime': 34560, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.559975}]}

of the faculty parents
----

idx: 217
response: {'results': [{'alternatives': [{'transcript': 'of the faculty parents', 'words': [{'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}, {'startTime': 34280, 'endTime': 34640, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.719975}]}

of the faculty parents
----

idx: 218
response: {'results': [{'alternatives': [{'transcript': 'members ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 34.879974}, {'alternatives': [{'transcript': 'the faculty parents', 'words': [{'startTime': 33040, 'endTime': 33080, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}, {'startTime': 34280, 'endTime': 34680, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.879974}]}

members |the faculty parents
----

idx: 219
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.039974}, {'alternatives': [{'transcript': 'faculty parents', 'words': [{'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}, {'startTime': 34280, 'endTime': 34680, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.039974}]}

members of the |faculty parents
----

idx: 220
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.199974}, {'alternatives': [{'transcript': 'cult parents', 'words': [{'startTime': 33360, 'endTime': 33520, 'word': 'cult'}, {'startTime': 34280, 'endTime': 34680, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.199974}]}

members of the |cult parents
----

idx: 221
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.359974}, {'alternatives': [{'transcript': 'parents', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.359974}]}

members of the |parents
----

idx: 222
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.519974}, {'alternatives': [{'transcript': 'parents and', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35280, 'endTime': 35320, 'word': 'and'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.519974}]}

members of the |parents and
----

idx: 223
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.679974}, {'alternatives': [{'transcript': 'parents and especially', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35320, 'endTime': 35360, 'word': 'and'}, {'startTime': 35400, 'endTime': 35760, 'word': 'especially'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.679974}]}

members of the |parents and especially
----

idx: 224
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.839973}, {'alternatives': [{'transcript': 'parents and especially', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35320, 'endTime': 35360, 'word': 'and'}, {'startTime': 35440, 'endTime': 35880, 'word': 'especially'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.839973}]}

members of the |parents and especially
----

idx: 225
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.999973}, {'alternatives': [{'transcript': 'parents and especially the', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35320, 'endTime': 35360, 'word': 'and'}, {'startTime': 35480, 'endTime': 35920, 'word': 'especially'}, {'startTime': 36000, 'endTime': 36040, 'word': 'the'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.999973}]}

members of the |parents and especially the
----

idx: 226
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 36.159973}, {'alternatives': [{'transcript': 'parents and especially the great', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35360, 'endTime': 35400, 'word': 'and'}, {'startTime': 35480, 'endTime': 35920, 'word': 'especially'}, {'startTime': 35960, 'endTime': 36000, 'word': 'the'}, {'startTime': 36000, 'endTime': 36160, 'word': 'great'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 36.159973}]}

members of the |parents and especially the great
----

idx: 227
response: {'results': [{'alternatives': [{'transcript': 'members of the faculty ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 36.319973}, {'alternatives': [{'transcript': 'and especially the gradu', 'words': [{'startTime': 35360, 'endTime': 35400, 'word': 'and'}, {'startTime': 35480, 'endTime': 35960, 'word': 'especially'}, {'startTime': 36000, 'endTime': 36040, 'word': 'the'}, {'startTime': 36080, 'endTime': 36320, 'word': 'gradu'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 36.319973}]}

members of the faculty |and especially the gradu
----
...

The discontinuity becomes evident only when one uses the flashlight decoder, with a greedy decoder there is no such problem. In fact no stability change occurs at all, in the case of a greedy decoder the stability=0.1 for the entire duration of interim results.

How should I change the pipeline configuration parameters to diminish/remove this discontinuity (i.e. not have lost word/s), and decrease the latency to the 0.1/0.9 split (in the trace above approx. 1.5s of the most recent transcript has stability 0.1, older transcripts have stability 0.9). I would like to reduce this number as much as possible.

Hi @ilb

Thanks for your interest in Riva

I will check regarding this issue with the internal team

Thanks