Rebuilding the asrset3 citrinet offline pipeline but with larger chunk size

g4dn.xlarge T4 16GB
Riva v1.8b

Follow up to: Offline/Batch broken on 1.8b due to 900s limit - #3 by rleary

Hi @rleary, thank you so much, and I really appreciate the discussion.

Regarding reproducing the model - I initially tried to reproduce the rmir_asr_citrinet_1024_asrset3p0_offline model included with riva 1.7b to be able to use the previous streaming-offline mode. However as I understand it, the offline recognition batching process of Riva has been updated, and simply using the older streaming-offline ensembles won’t give me the previous offline functionality. When trying to use the previous pipeline it just gives me an error stating that the maximum input audio can only be 15 seconds, and does not act as the previous offline pipeline with which I can transcribe large files with a Recognize call.

Now my aim is to rebuild the v1.8 citrinet-offline pipeline but with a chunk size of 7200.

I followed the steps from the docs to reproduce, but have a few queries regarding some artifacts.

The build command I used:

riva-build speech_recognition \
   Citrinet-1024-true-offline.rmir:tlt_encode Citrinet-1024-Jarvis-ASRSet-3_0-encrypted.riva:tlt_encode \
   --offline \
   --name=citrinet-1024-english-asr-true-offline \
   --ms_per_timestep=80 \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --chunk_size=7200 \
   --left_padding_size=0. \
   --right_padding_size=0. \
   --decoder_type=flashlight \
   --flashlight_decoder.asr_model_delay=-1 \
   --decoding_language_model_binary=jarvis_asr_train_datasets_noSpgi_noLS_gt_3gram.binary \
   --decoding_vocab=lexicon.txt \
   --flashlight_decoder.lm_weight=0.2 \
   --flashlight_decoder.word_insertion_score=0.2 \
   --flashlight_decoder.beam_threshold=20. \

My queries are regarding the decoding_vocab and decoding_language_model_binary params. What should they be set to to recreate the prebuilt rmirs?
Here, I ended up pulling them from/data/models/citrinet-1024-en-US-asr-offline-ctc-decoder-cpu-streaming-offline/1/ lexicon.txt and jarvis_asr_train_datasets_noSpgi_noLS_gt_3gram.binary. Are these available for easy pulling from NGC to recreate the prebuilt rmirs?

For the .riva file, I pulled it from the tao model card Speech to Text English Citrinet | NVIDIA NGC

I chose the deployable_v3.0 version .riva file: Citrinet-1024-Jarvis-ASRSet-3_0-encrypted.riva Is there a Nemo pretrained model I can export to get this .riva?

Thanks for your help with this build, I simply want to be able to reproduce the given rmirs and be able to build off of the pretrained models/ensembles.

Hi @shantanu1 ,

Thanks for reaching out. There may be a delay in response time with many folks on holiday this week.

@rleary to help follow up