Does canary not support live transcription/streaming?

pruthvidhar.nanda · January 22, 2025, 6:23am

Hardware - GPU (A10)
Hardware - CPU
Operating System: Ubuntu 22.04
Riva Version: 2.18.0
TLT Version (if relevant) - null
How to reproduce the issue ? (This is for errors. Please share the command and the detailed log here)
By pulling the canary 0.6b_turbo model into Riva and by running this below command, you get:
python scripts/asr/riva_streaming_asr_client.py --model-name ‘canary-0.6b-turbo-multi-asr-offline-asr-bls-ensemble’ --input-file ‘wz0YOwXoBMKKYtXE.wav’ --language-code ar

Number of clients: 1
Number of iteration: 1
Input file: wz0YOwXoBMKKYtXE.wav
Exception in thread Thread-1 (streaming_transcription_worker):
Traceback (most recent call last):
File “/home/mw/anaconda3/envs/riva_clients/lib/python3.11/threading.py”, line 1045, in _bootstrap_inner
self.run()
File “/home/mw/anaconda3/envs/riva_clients/lib/python3.11/threading.py”, line 982, in run
self._target(*self._args, **self._kwargs)
File “/home/mw/RIVA-STT/riva_quickstart_v2.18.0/python-clients/python-clients/scripts/asr/riva_streaming_asr_client.py”, line 87, in streaming_transcription_worker
riva.client.print_streaming(
File “/home/mw/anaconda3/envs/riva_clients/lib/python3.11/site-packages/riva/client/asr.py”, line 246, in print_streaming
for response in responses:
File “/home/mw/anaconda3/envs/riva_clients/lib/python3.11/site-packages/riva/client/asr.py”, line 393, in streaming_response_generator
for response in self.stub.StreamingRecognize(generator, metadata=self.auth.get_auth_metadata()):
File “/home/mw/anaconda3/envs/riva_clients/lib/python3.11/site-packages/grpc/_channel.py”, line 543, in next
return self._next()
^^^^^^^^^^^^
File “/home/mw/anaconda3/envs/riva_clients/lib/python3.11/site-packages/grpc/_channel.py”, line 969, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = “OfflineAsrEnsemble expects both start and end flags to be 1”
debug_error_string = “UNKNOWN:Error received from peer ipv6:%5B::1%5D:50051 {grpc_message:“OfflineAsrEnsemble expects both start and end flags to be 1”, grpc_status:3, created_time:“2025-01-22T09:17:41.796966261+03:00”}”

Traceback (most recent call last):
File “/home/mw/RIVA-STT/riva_quickstart_v2.18.0/python-clients/python-clients/scripts/asr/riva_streaming_asr_client.py”, line 133, in
main()
File “/home/mw/RIVA-STT/riva_quickstart_v2.18.0/python-clients/python-clients/scripts/asr/riva_streaming_asr_client.py”, line 119, in main
raise RuntimeError(f"A thread with index {thread_i} failed with error:\n{exc}")
RuntimeError: A thread with index 0 failed with error:
<_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = “OfflineAsrEnsemble expects both start and end flags to be 1”
debug_error_string = “UNKNOWN:Error received from peer ipv6:%5B::1%5D:50051 {grpc_message:“OfflineAsrEnsemble expects both start and end flags to be 1”, grpc_status:3, created_time:“2025-01-22T09:17:41.796966261+03:00”}”

pruthvidhar.nanda · January 22, 2025, 11:47am

For clarification, this is what the loading logs say:
[TensorRT-LLM] TensorRT-LLM version: 0.10.0
2025-01-22 11:40:39,168 [INFO] Writing Riva model repository to ‘/data/models’…
2025-01-22 11:40:39,168 [INFO] The riva model repo target directory is /data/models
model.config=AsrNetConfig(model_family=‘riva’, language_code=‘en-US,en-GB,es-ES,ar-AR,es-US,pt-BR,fr-FR,de-DE,it-IT,ja-JP,ko-KR,ru-RU,hi-IN’, instance_group_count=1, kind=‘CPU’, max_batch_size=8, max_queue_delay_microseconds=1000, batching_type=‘none’, pipeline_name=‘speech_recognition’, acoustic_model_name=None, featurizer_name=None, name=‘canary-0.6b-turbo-multi-asr-offline’, streaming=True, offline=True, vad_type=‘none’, unified_acoustic_model=False, endpointing_type=‘greedy_ctc’, chunk_size=30.0, type=‘offline’, padding_factor=None, left_padding_size=0.0, right_padding_size=0.0, padding_size=None, max_supported_transcripts=‘1’, ms_per_timestep=20, force_decoder_reset_after_ms=-1, lattice_beam=‘5’, decoding_language_model_arpa=‘’, decoding_language_model_binary=‘’, decoding_language_model_fst=‘’, decoding_language_model_words=‘’, rescoring_language_model_arpa=‘’, decoding_language_model_carpa=‘’, rescoring_language_model_carpa=‘’, decoding_lexicon=‘’, decoding_vocab=‘’, tokenizer_model=‘’, decoder_type=‘nemo’, stddev_floor=1e-05, wfst_tokenizer_model=‘’, wfst_verbalizer_model=‘’, wfst_pre_process_model=‘’, wfst_post_process_model=‘’, speech_hints_model=‘’, wfst_model_dir=‘’, buffer_look_ahead=0, buffer_context_history=0, buffer_threshold=0, buffer_max_timeout_frames=0, profane_words_file=‘’, append_space_to_transcripts=‘True’, return_separate_utterances=‘True’, cased=False, endpointing_num_worker_threads=16, audio_processing_num_worker_threads=8, enable_punctuation=True, mel_basis_file_path=‘/tmp/tmpib4sj8s9/16khz_mel_basis.npy’, feature_extractor_type=‘torch’, torch_feature_type=‘nemo’, torch_feature_device=‘cuda’, execution_environment_path=None, share_flags=False, n_fft=512, num_features=128, window_stride=0.01, window_size=0.025, vad_num_features=None, vad_window_stride=None, vad_window_size=None, vad_sample_rate=None, use_subword=‘False’, vocab_file=‘/tmp/tmpmzqumpgf/riva_decoder_vocabulary.txt’, config_yaml=‘’, acoustic_model_class=‘nemo.collections.asr.models.aed_multitask_models.EncDecMultiTaskModel’, lm_class=‘’, vad_nn_class=None, lm_binary=False, num_encoded_features=None, num_char_classes=1, num_input_samples=3001, sample_rate=16000)
2025-01-22 11:41:33,211 [INFO] Extract_binaries for nn → /data/models/riva-nemo-canary-0.6b-turbo-multi-asr-offline-am-streaming-offline/1
2025-01-22 11:41:33,212 [INFO] extracting {‘nemo’: (‘nemo.collections.asr.models.aed_multitask_models.EncDecMultiTaskModel’, ‘model_graph.nemo’)} → /data/models/riva-nemo-canary-0.6b-turbo-multi-asr-offline-am-streaming-offline/1
2025-01-22 11:41:34,748 [INFO] Extract_binaries for asr_ensemble_backend → /data/models/canary-0.6b-turbo-multi-asr-offline-asr-bls-ensemble/1
2025-01-22 11:41:34,748 [INFO] extracting {‘mel_basis_file_path’: ‘/tmp/tmpib4sj8s9/16khz_mel_basis.npy’, ‘vocab_file’: ‘/tmp/tmpmzqumpgf/riva_decoder_vocabulary.txt’} → /data/models/canary-0.6b-turbo-multi-asr-offline-asr-bls-ensemble/1
2025-01-22 11:41:34,749 [INFO] {‘mel_basis_file_path’: ‘/data/models/canary-0.6b-turbo-multi-asr-offline-asr-bls-ensemble/1/16khz_mel_basis.npy’, ‘vocab_file’: ‘/data/models/canary-0.6b-turbo-multi-asr-offline-asr-bls-ensemble/1/riva_decoder_vocabulary.txt’}

‘[’ 0 -ne 0 ‘]’
[[ non-tegra == \t\e\g\r\a ]]
echo
echo ‘Riva initialization complete. Run ./riva_start.sh to launch services.’
Riva initialization complete. Run ./riva_start.sh to launch services.

I have bolded the streaming=True flag for further clarity. Is streaming and live transcription two different things or am I doing something wrong?

amargolin · January 23, 2025, 4:38pm

Canary is an encoder-decoder model designed for offline processing. The transformer decoder, typically requires the full input sequence. The model processes the entire audio input, similar to Whisper. It supports inference up to 40sec. Long form audios can be inferred with chunks of 40 seconds. See script:

github.com/NVIDIA/NeMo

examples/asr/asr_chunked_inference/aed/speech_to_text_aed_chunked_infer.py

main

# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
This script chunks long audios into non-overlapping segments of `chunk_len_in_secs` 
seconds and performs inference on each 
segment individually. The results are then concatenated to form the final output.

Below is an example of how to run this script with the Canary-1b model.

This file has been truncated. show original

Parakeet models are designed with streaming capabilities in mind. It use a FastConformer encoder with various decoder options (CTC, RNN-T, or TDT). Parakeet supports both offline and streaming inference. Techniques like cache-aware streaming and limited right context allows the streaming use-cases.

Future research and development will allows Canary support in streaming mode as well.

system · February 6, 2025, 4:38pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Riva v2.19 speaker diarization issue Riva riva	3	61	April 24, 2025
No streaming/live transcription feature for Whisper on the Riva? Riva	5	84	January 27, 2025
RIVA ASR StreamingRecognition low confidence for word transcripts Riva	1	490	November 29, 2023
Wrong outputs from our fine-tuned version of speechtotext_english_citrinet_1024.tlt after deploying using riva_init.sh Riva inception	3	779	August 12, 2022
Riva 2.0 ASR not working Riva	2	861	May 18, 2022
Error creating GRPC channel: Unable to establish connection to server Riva	9	1774	May 11, 2024
Help with custom deploy and perform inference using citrinet-mandarin NGC pre-trained model in Riva Riva riva	6	1123	October 12, 2021
Missing Information in the Docs Riva	5	787	October 12, 2021
How can I start Riva without an error Riva riva	7	2547	September 29, 2021
Init. Jarvis with german model Riva riva	9	1466	November 4, 2021

Does canary not support live transcription/streaming?

Related topics