RIVA en-US when using LM, interim results with stability change drop already predicted but less stable words

ilb · November 19, 2022, 9:47pm

Environment:

Nemo 1.12.0 (nemo:22.08)
Riva 2.7.0
T4
English Conformer CTC model + LM + Neural-based VAD all downloaded from links in Riva documentation.

The model was deployed with the following pipeline configuration

riva-build speech_recognition \
  <RMIR> <RIVA> <VAD> \
   --name=<name> \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --ms_per_timestep=40 \
   --nn.use_onnx_runtime \
   --nn.fp16_needs_obey_precision_pass \
   --chunk_size=0.16 \
   --left_padding_size=1.92 \
   --right_padding_size=1.92 \
   --decoder_type=flashlight \
   --decoding_vocab=<vocab> \
   --decoding_language_model_arpa=<arpa> \
   --rescoring_language_model_carpa=<carpa> \
   --flashlight_decoder.lm_weight=0.8 \
   --flashlight_decoder.word_insertion_score=1 \
   --flashlight_decoder.beam_threshold=20.0 \
   --flashlight_decoder.beam_size=32 \
   --flashlight_decoder.beam_size_token=10 \
   --flashlight_decoder.num_tokenization=1 \
   --flashlight_decoder.asr_model_delay=-1 \
   --endpointing.residue_blanks_at_start=-2 \
   --vad_type=neural \
   --neural_vad_nn.optimization_graph_level=-1

The test file was a section of Bill Gates’ Harvard Commencement Speech.

In the trace below there is a noticeable discontinuity between the end of segment that reached stability 0.9 and start of segment with stability 0.1 (marked with the pipe symbol in the following trace):

...
idx: 214
response: {'results': [{'alternatives': [{'transcript': 'members of the faculty', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32920, 'endTime': 32960, 'word': 'of'}, {'startTime': 33080, 'endTime': 33120, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.239975}]}

members of the faculty
----

idx: 215
response: {'results': [{'alternatives': [{'transcript': 'members of the faculty', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32920, 'endTime': 32960, 'word': 'of'}, {'startTime': 33080, 'endTime': 33120, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.399975}]}

members of the faculty
----

idx: 216
response: {'results': [{'alternatives': [{'transcript': 'of the faculty parents', 'words': [{'startTime': 32920, 'endTime': 32960, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}, {'startTime': 34240, 'endTime': 34560, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.559975}]}

of the faculty parents
----

idx: 217
response: {'results': [{'alternatives': [{'transcript': 'of the faculty parents', 'words': [{'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}, {'startTime': 34280, 'endTime': 34640, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.719975}]}

of the faculty parents
----

idx: 218
response: {'results': [{'alternatives': [{'transcript': 'members ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 34.879974}, {'alternatives': [{'transcript': 'the faculty parents', 'words': [{'startTime': 33040, 'endTime': 33080, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}, {'startTime': 34280, 'endTime': 34680, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.879974}]}

members |the faculty parents
----

idx: 219
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.039974}, {'alternatives': [{'transcript': 'faculty parents', 'words': [{'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}, {'startTime': 34280, 'endTime': 34680, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.039974}]}

members of the |faculty parents
----

idx: 220
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.199974}, {'alternatives': [{'transcript': 'cult parents', 'words': [{'startTime': 33360, 'endTime': 33520, 'word': 'cult'}, {'startTime': 34280, 'endTime': 34680, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.199974}]}

members of the |cult parents
----

idx: 221
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.359974}, {'alternatives': [{'transcript': 'parents', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.359974}]}

members of the |parents
----

idx: 222
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.519974}, {'alternatives': [{'transcript': 'parents and', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35280, 'endTime': 35320, 'word': 'and'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.519974}]}

members of the |parents and
----

idx: 223
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.679974}, {'alternatives': [{'transcript': 'parents and especially', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35320, 'endTime': 35360, 'word': 'and'}, {'startTime': 35400, 'endTime': 35760, 'word': 'especially'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.679974}]}

members of the |parents and especially
----

idx: 224
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.839973}, {'alternatives': [{'transcript': 'parents and especially', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35320, 'endTime': 35360, 'word': 'and'}, {'startTime': 35440, 'endTime': 35880, 'word': 'especially'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.839973}]}

members of the |parents and especially
----

idx: 225
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.999973}, {'alternatives': [{'transcript': 'parents and especially the', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35320, 'endTime': 35360, 'word': 'and'}, {'startTime': 35480, 'endTime': 35920, 'word': 'especially'}, {'startTime': 36000, 'endTime': 36040, 'word': 'the'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.999973}]}

members of the |parents and especially the
----

idx: 226
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 36.159973}, {'alternatives': [{'transcript': 'parents and especially the great', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35360, 'endTime': 35400, 'word': 'and'}, {'startTime': 35480, 'endTime': 35920, 'word': 'especially'}, {'startTime': 35960, 'endTime': 36000, 'word': 'the'}, {'startTime': 36000, 'endTime': 36160, 'word': 'great'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 36.159973}]}

members of the |parents and especially the great
----

idx: 227
response: {'results': [{'alternatives': [{'transcript': 'members of the faculty ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 36.319973}, {'alternatives': [{'transcript': 'and especially the gradu', 'words': [{'startTime': 35360, 'endTime': 35400, 'word': 'and'}, {'startTime': 35480, 'endTime': 35960, 'word': 'especially'}, {'startTime': 36000, 'endTime': 36040, 'word': 'the'}, {'startTime': 36080, 'endTime': 36320, 'word': 'gradu'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 36.319973}]}

members of the faculty |and especially the gradu
----
...

The discontinuity becomes evident only when one uses the flashlight decoder, with a greedy decoder there is no such problem. In fact no stability change occurs at all, in the case of a greedy decoder the stability=0.1 for the entire duration of interim results.

How should I change the pipeline configuration parameters to diminish/remove this discontinuity (i.e. not have lost word/s), and decrease the latency to the 0.1/0.9 split (in the trace above approx. 1.5s of the most recent transcript has stability 0.1, older transcripts have stability 0.9). I would like to reduce this number as much as possible.

rvinobha · November 21, 2022, 5:13pm

Hi @ilb

Thanks for your interest in Riva

I will check regarding this issue with the internal team

Thanks

ilb · December 5, 2022, 8:57pm

@rvinobha any news on this? For my use case it is really problematic.

rvinobha · December 6, 2022, 6:04pm

Hi @ilb

Apologies for the delay,

The Internal team are still debugging the issue, Once i have an update will provide you

Thanks

ilb · January 16, 2023, 4:46pm

@rvinobha any updates?

rvinobha · January 17, 2023, 6:41am

Hi @ilb

Sincere Apologies, I still don’t have updates on this thread, have asked again the concerned team to check and provide feedback at earliest

Thanks

ilb · February 3, 2023, 7:39pm

@rvinobha any progress? Actually any info would be really welcome, as this is quite a major drawback for my use case.

rvinobha · February 7, 2023, 6:54am

HI @ilb

Sincere Apologies, i have pushed for updates, will push again,
Still I am waiting for an update on this internal ticket created

Thanks

ilb · March 2, 2023, 9:42am

@rvinobha I’ve tested with the latest riva version (2.9.0) and I still have the same issue. Any news from the team? If a different model would be more suitable, please tell.

With the transcribe_file.py example script, the issue can be seen as follows. Assume an intermediate transcript where some of it has stability 0.9 and some stability 0.1. When time progresses and new intermediate results are displayed, the portion with stability 0.9 will become longer, individual words from the start of portion with stability 0.1 will get removed from stability 0.1 and get appended to the portion with stability 0.9. However, oftentimes they (even if the word was correct in the first place) first change to something else or disappear completely, before they reappear in the correct form at the end of the transcript with stability 0.9.

Based on experiments there seems to be a relation with the parameters left_padding_size, right_padding_size and chunk_size, i.e. the stability 0.1 seems to contain right_padding_size audio data, followed by chunk_size data, a default value of 0.16, which could lead to “dissapearing words” (most but the shortest words take more than 0.16s)?

rvinobha · March 28, 2023, 2:25pm

Hi @ilb

Apologies for the long delay

I have feedback on this issue from the internal team

“Yes we expect stability to improve with cache-aware. The decoder could be tuned to give better stability e.g. reducing beam-size, and beam-size-token. partial transcripts are being returned here also. We could just look at final transcripts which are stable. Using the streaming-throughput configuration should also produce more stable results given the longer chunk size.”

let me know if you have further question, I will address it with the team

Topic		Replies	Views
RIVA error, when deploying official Conformer ASR network Riva riva	10	1944	January 27, 2023
Riva 1.8 riva_start.sh fail when build with language model Riva riva	3	1169	July 27, 2022
Not able to run LM fine tuned qurtznet model Riva riva	13	1264	October 8, 2021
Error creating GRPC channel: Unable to establish connection to server Riva	9	1768	May 11, 2024
Recreate QuickStart Stock Citrinet Model with Modified Parameters Riva	14	1713	August 4, 2022
Nvidia Riva health check fail Riva riva	1	463	February 14, 2025
Init. Jarvis with german model Riva riva	9	1466	November 4, 2021
RIVA ASR StreamingRecognition low confidence for word transcripts Riva	1	488	November 29, 2023
Riva ASR transcript cut off? Riva	11	1324	March 20, 2022
Riva quickstart 2.11 fails on xavier nx Riva	3	916	June 29, 2023

RIVA en-US when using LM, interim results with stability change drop already predicted but less stable words

Related topics