Does canary not support live transcription/streaming?

Hardware - GPU (A10)
Hardware - CPU
Operating System: Ubuntu 22.04
Riva Version: 2.18.0
TLT Version (if relevant) - null
How to reproduce the issue ? (This is for errors. Please share the command and the detailed log here)
By pulling the canary 0.6b_turbo model into Riva and by running this below command, you get:
python scripts/asr/riva_streaming_asr_client.py --model-name ‘canary-0.6b-turbo-multi-asr-offline-asr-bls-ensemble’ --input-file ‘wz0YOwXoBMKKYtXE.wav’ --language-code ar

Number of clients: 1
Number of iteration: 1
Input file: wz0YOwXoBMKKYtXE.wav
Exception in thread Thread-1 (streaming_transcription_worker):
Traceback (most recent call last):
File “/home/mw/anaconda3/envs/riva_clients/lib/python3.11/threading.py”, line 1045, in _bootstrap_inner
self.run()
File “/home/mw/anaconda3/envs/riva_clients/lib/python3.11/threading.py”, line 982, in run
self._target(*self._args, **self._kwargs)
File “/home/mw/RIVA-STT/riva_quickstart_v2.18.0/python-clients/python-clients/scripts/asr/riva_streaming_asr_client.py”, line 87, in streaming_transcription_worker
riva.client.print_streaming(
File “/home/mw/anaconda3/envs/riva_clients/lib/python3.11/site-packages/riva/client/asr.py”, line 246, in print_streaming
for response in responses:
File “/home/mw/anaconda3/envs/riva_clients/lib/python3.11/site-packages/riva/client/asr.py”, line 393, in streaming_response_generator
for response in self.stub.StreamingRecognize(generator, metadata=self.auth.get_auth_metadata()):
File “/home/mw/anaconda3/envs/riva_clients/lib/python3.11/site-packages/grpc/_channel.py”, line 543, in next
return self._next()
^^^^^^^^^^^^
File “/home/mw/anaconda3/envs/riva_clients/lib/python3.11/site-packages/grpc/_channel.py”, line 969, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = “OfflineAsrEnsemble expects both start and end flags to be 1”
debug_error_string = “UNKNOWN:Error received from peer ipv6:%5B::1%5D:50051 {grpc_message:“OfflineAsrEnsemble expects both start and end flags to be 1”, grpc_status:3, created_time:“2025-01-22T09:17:41.796966261+03:00”}”

Traceback (most recent call last):
File “/home/mw/RIVA-STT/riva_quickstart_v2.18.0/python-clients/python-clients/scripts/asr/riva_streaming_asr_client.py”, line 133, in
main()
File “/home/mw/RIVA-STT/riva_quickstart_v2.18.0/python-clients/python-clients/scripts/asr/riva_streaming_asr_client.py”, line 119, in main
raise RuntimeError(f"A thread with index {thread_i} failed with error:\n{exc}")
RuntimeError: A thread with index 0 failed with error:
<_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = “OfflineAsrEnsemble expects both start and end flags to be 1”
debug_error_string = “UNKNOWN:Error received from peer ipv6:%5B::1%5D:50051 {grpc_message:“OfflineAsrEnsemble expects both start and end flags to be 1”, grpc_status:3, created_time:“2025-01-22T09:17:41.796966261+03:00”}”

For clarification, this is what the loading logs say:
[TensorRT-LLM] TensorRT-LLM version: 0.10.0
2025-01-22 11:40:39,168 [INFO] Writing Riva model repository to ‘/data/models’…
2025-01-22 11:40:39,168 [INFO] The riva model repo target directory is /data/models
model.config=AsrNetConfig(model_family=‘riva’, language_code=‘en-US,en-GB,es-ES,ar-AR,es-US,pt-BR,fr-FR,de-DE,it-IT,ja-JP,ko-KR,ru-RU,hi-IN’, instance_group_count=1, kind=‘CPU’, max_batch_size=8, max_queue_delay_microseconds=1000, batching_type=‘none’, pipeline_name=‘speech_recognition’, acoustic_model_name=None, featurizer_name=None, name=‘canary-0.6b-turbo-multi-asr-offline’, streaming=True, offline=True, vad_type=‘none’, unified_acoustic_model=False, endpointing_type=‘greedy_ctc’, chunk_size=30.0, type=‘offline’, padding_factor=None, left_padding_size=0.0, right_padding_size=0.0, padding_size=None, max_supported_transcripts=‘1’, ms_per_timestep=20, force_decoder_reset_after_ms=-1, lattice_beam=‘5’, decoding_language_model_arpa=‘’, decoding_language_model_binary=‘’, decoding_language_model_fst=‘’, decoding_language_model_words=‘’, rescoring_language_model_arpa=‘’, decoding_language_model_carpa=‘’, rescoring_language_model_carpa=‘’, decoding_lexicon=‘’, decoding_vocab=‘’, tokenizer_model=‘’, decoder_type=‘nemo’, stddev_floor=1e-05, wfst_tokenizer_model=‘’, wfst_verbalizer_model=‘’, wfst_pre_process_model=‘’, wfst_post_process_model=‘’, speech_hints_model=‘’, wfst_model_dir=‘’, buffer_look_ahead=0, buffer_context_history=0, buffer_threshold=0, buffer_max_timeout_frames=0, profane_words_file=‘’, append_space_to_transcripts=‘True’, return_separate_utterances=‘True’, cased=False, endpointing_num_worker_threads=16, audio_processing_num_worker_threads=8, enable_punctuation=True, mel_basis_file_path=‘/tmp/tmpib4sj8s9/16khz_mel_basis.npy’, feature_extractor_type=‘torch’, torch_feature_type=‘nemo’, torch_feature_device=‘cuda’, execution_environment_path=None, share_flags=False, n_fft=512, num_features=128, window_stride=0.01, window_size=0.025, vad_num_features=None, vad_window_stride=None, vad_window_size=None, vad_sample_rate=None, use_subword=‘False’, vocab_file=‘/tmp/tmpmzqumpgf/riva_decoder_vocabulary.txt’, config_yaml=‘’, acoustic_model_class=‘nemo.collections.asr.models.aed_multitask_models.EncDecMultiTaskModel’, lm_class=‘’, vad_nn_class=None, lm_binary=False, num_encoded_features=None, num_char_classes=1, num_input_samples=3001, sample_rate=16000)
2025-01-22 11:41:33,211 [INFO] Extract_binaries for nn → /data/models/riva-nemo-canary-0.6b-turbo-multi-asr-offline-am-streaming-offline/1
2025-01-22 11:41:33,212 [INFO] extracting {‘nemo’: (‘nemo.collections.asr.models.aed_multitask_models.EncDecMultiTaskModel’, ‘model_graph.nemo’)} → /data/models/riva-nemo-canary-0.6b-turbo-multi-asr-offline-am-streaming-offline/1
2025-01-22 11:41:34,748 [INFO] Extract_binaries for asr_ensemble_backend → /data/models/canary-0.6b-turbo-multi-asr-offline-asr-bls-ensemble/1
2025-01-22 11:41:34,748 [INFO] extracting {‘mel_basis_file_path’: ‘/tmp/tmpib4sj8s9/16khz_mel_basis.npy’, ‘vocab_file’: ‘/tmp/tmpmzqumpgf/riva_decoder_vocabulary.txt’} → /data/models/canary-0.6b-turbo-multi-asr-offline-asr-bls-ensemble/1
2025-01-22 11:41:34,749 [INFO] {‘mel_basis_file_path’: ‘/data/models/canary-0.6b-turbo-multi-asr-offline-asr-bls-ensemble/1/16khz_mel_basis.npy’, ‘vocab_file’: ‘/data/models/canary-0.6b-turbo-multi-asr-offline-asr-bls-ensemble/1/riva_decoder_vocabulary.txt’}

  • ‘[’ 0 -ne 0 ‘]’

  • [[ non-tegra == \t\e\g\r\a ]]

  • echo

  • echo ‘Riva initialization complete. Run ./riva_start.sh to launch services.’
    Riva initialization complete. Run ./riva_start.sh to launch services.

I have bolded the streaming=True flag for further clarity. Is streaming and live transcription two different things or am I doing something wrong?

Canary is an encoder-decoder model designed for offline processing. The transformer decoder, typically requires the full input sequence. The model processes the entire audio input, similar to Whisper. It supports inference up to 40sec. Long form audios can be inferred with chunks of 40 seconds. See script:

Parakeet models are designed with streaming capabilities in mind. It use a FastConformer encoder with various decoder options (CTC, RNN-T, or TDT). Parakeet supports both offline and streaming inference. Techniques like cache-aware streaming and limited right context allows the streaming use-cases.

Future research and development will allows Canary support in streaming mode as well.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.