Unless the punctuation model is a currently a known limitation (I went through 1.10 release notes), I believe this is unexpected behavior.
Hardware - GPU (T4 AWS EC2)
Riva Version v1.10b
How to reproduce the issue ?
- In my case - I start with building a riva citrinet offline pipeline like so
riva-build speech_recognition \
"citrinet-1024-true-offline.rmir:tlt_encode" "citrinet-1024-Jarvis-asrset-3_0-encrypted.riva:tlt_encode" \
--offline --nn.trt_max_workspace_size=14000000000 \
--name=citrinet-1024-english-asr-offline \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--chunk_size=2700 \
--left_padding_size=0. \
--right_padding_size=0. \
--decoder_type=flashlight \
--flashlight_decoder.asr_model_delay=-1 \
--decoding_language_model_binary=riva_asr_train_datasets_3gram.binary \
--decoding_vocab=flashlight_decoder_vocab.txt \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=en-US
-
Run
riva_asr_client --audio_file=wav/10minutes.wav -output_filename=out.txt
in riva-client image -
See out.txt
Expected output: Complete transcript
Observed output:
Run time: 5.4236 sec.
Total audio processed: 2358.6 sec.
Throughput: 434.88 RTFX
Final transcripts written to out.txt
root@ip-172-31-7-237:/work/examples# cat out.txt
{"audio_filepath": "/work/examples/wav/craig-full-16k.wav","text": "Believe it's recording now. Okay? sorry, back to sharing the desktop. Okay, so I'd like to just get to know a little bit about yourself. Like what is it that you do or what do you focus on? What are you passionate about? It doesn't have to be a long answer. Anything that you're comfortable sharing? Yeah, sure thing, and I appreciate you asking. So Mylo? we're about a year old. I was working with one co founder who had the idea What Mylo does is we allow you to create and share our processes seamlessly across the Internet, right, So what we're seeking to do is replace cases and "}
Clearly all the audio is processed: Total audio processed: 2358.6 sec.
but after punctuation, the pipeline outputs transcripts cut very short.
This practically breaks our use case for Riva/offline ASR. Any workarounds to get a usable offline recognition pipeline for longer audio would be appreciated!
Do let me know if any more details would be appreciated.