Environment:
- Nemo 1.12.0 (nemo:22.08)
- Riva 2.7.0
- T4
- English Conformer CTC model + LM + Neural-based VAD all downloaded from links in Riva documentation.
The model was deployed with the following pipeline configuration
riva-build speech_recognition \
<RMIR> <RIVA> <VAD> \
--name=<name> \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--ms_per_timestep=40 \
--nn.use_onnx_runtime \
--nn.fp16_needs_obey_precision_pass \
--chunk_size=0.16 \
--left_padding_size=1.92 \
--right_padding_size=1.92 \
--decoder_type=flashlight \
--decoding_vocab=<vocab> \
--decoding_language_model_arpa=<arpa> \
--rescoring_language_model_carpa=<carpa> \
--flashlight_decoder.lm_weight=0.8 \
--flashlight_decoder.word_insertion_score=1 \
--flashlight_decoder.beam_threshold=20.0 \
--flashlight_decoder.beam_size=32 \
--flashlight_decoder.beam_size_token=10 \
--flashlight_decoder.num_tokenization=1 \
--flashlight_decoder.asr_model_delay=-1 \
--endpointing.residue_blanks_at_start=-2 \
--vad_type=neural \
--neural_vad_nn.optimization_graph_level=-1
The test file was a section of Bill Gates’ Harvard Commencement Speech.
In the trace below there is a noticeable discontinuity between the end of segment that reached stability 0.9 and start of segment with stability 0.1 (marked with the pipe symbol in the following trace):
...
idx: 214
response: {'results': [{'alternatives': [{'transcript': 'members of the faculty', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32920, 'endTime': 32960, 'word': 'of'}, {'startTime': 33080, 'endTime': 33120, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.239975}]}
members of the faculty
----
idx: 215
response: {'results': [{'alternatives': [{'transcript': 'members of the faculty', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32920, 'endTime': 32960, 'word': 'of'}, {'startTime': 33080, 'endTime': 33120, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.399975}]}
members of the faculty
----
idx: 216
response: {'results': [{'alternatives': [{'transcript': 'of the faculty parents', 'words': [{'startTime': 32920, 'endTime': 32960, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}, {'startTime': 34240, 'endTime': 34560, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.559975}]}
of the faculty parents
----
idx: 217
response: {'results': [{'alternatives': [{'transcript': 'of the faculty parents', 'words': [{'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}, {'startTime': 34280, 'endTime': 34640, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.719975}]}
of the faculty parents
----
idx: 218
response: {'results': [{'alternatives': [{'transcript': 'members ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 34.879974}, {'alternatives': [{'transcript': 'the faculty parents', 'words': [{'startTime': 33040, 'endTime': 33080, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}, {'startTime': 34280, 'endTime': 34680, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 34.879974}]}
members |the faculty parents
----
idx: 219
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.039974}, {'alternatives': [{'transcript': 'faculty parents', 'words': [{'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}, {'startTime': 34280, 'endTime': 34680, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.039974}]}
members of the |faculty parents
----
idx: 220
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.199974}, {'alternatives': [{'transcript': 'cult parents', 'words': [{'startTime': 33360, 'endTime': 33520, 'word': 'cult'}, {'startTime': 34280, 'endTime': 34680, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.199974}]}
members of the |cult parents
----
idx: 221
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.359974}, {'alternatives': [{'transcript': 'parents', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.359974}]}
members of the |parents
----
idx: 222
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.519974}, {'alternatives': [{'transcript': 'parents and', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35280, 'endTime': 35320, 'word': 'and'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.519974}]}
members of the |parents and
----
idx: 223
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.679974}, {'alternatives': [{'transcript': 'parents and especially', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35320, 'endTime': 35360, 'word': 'and'}, {'startTime': 35400, 'endTime': 35760, 'word': 'especially'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.679974}]}
members of the |parents and especially
----
idx: 224
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.839973}, {'alternatives': [{'transcript': 'parents and especially', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35320, 'endTime': 35360, 'word': 'and'}, {'startTime': 35440, 'endTime': 35880, 'word': 'especially'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.839973}]}
members of the |parents and especially
----
idx: 225
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 35.999973}, {'alternatives': [{'transcript': 'parents and especially the', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35320, 'endTime': 35360, 'word': 'and'}, {'startTime': 35480, 'endTime': 35920, 'word': 'especially'}, {'startTime': 36000, 'endTime': 36040, 'word': 'the'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 35.999973}]}
members of the |parents and especially the
----
idx: 226
response: {'results': [{'alternatives': [{'transcript': 'members of the ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 36.159973}, {'alternatives': [{'transcript': 'parents and especially the great', 'words': [{'startTime': 34280, 'endTime': 34680, 'word': 'parents'}, {'startTime': 35360, 'endTime': 35400, 'word': 'and'}, {'startTime': 35480, 'endTime': 35920, 'word': 'especially'}, {'startTime': 35960, 'endTime': 36000, 'word': 'the'}, {'startTime': 36000, 'endTime': 36160, 'word': 'great'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 36.159973}]}
members of the |parents and especially the great
----
idx: 227
response: {'results': [{'alternatives': [{'transcript': 'members of the faculty ', 'words': [{'startTime': 32560, 'endTime': 32880, 'word': 'members'}, {'startTime': 32960, 'endTime': 33000, 'word': 'of'}, {'startTime': 33040, 'endTime': 33080, 'word': 'the'}, {'startTime': 33200, 'endTime': 33600, 'word': 'faculty'}]}], 'stability': 0.9, 'channelTag': 1, 'audioProcessed': 36.319973}, {'alternatives': [{'transcript': 'and especially the gradu', 'words': [{'startTime': 35360, 'endTime': 35400, 'word': 'and'}, {'startTime': 35480, 'endTime': 35960, 'word': 'especially'}, {'startTime': 36000, 'endTime': 36040, 'word': 'the'}, {'startTime': 36080, 'endTime': 36320, 'word': 'gradu'}]}], 'stability': 0.1, 'channelTag': 1, 'audioProcessed': 36.319973}]}
members of the faculty |and especially the gradu
----
...
The discontinuity becomes evident only when one uses the flashlight decoder, with a greedy decoder there is no such problem. In fact no stability change occurs at all, in the case of a greedy decoder the stability=0.1 for the entire duration of interim results.
How should I change the pipeline configuration parameters to diminish/remove this discontinuity (i.e. not have lost word/s), and decrease the latency to the 0.1/0.9 split (in the trace above approx. 1.5s of the most recent transcript has stability 0.1, older transcripts have stability 0.9). I would like to reduce this number as much as possible.