Final transcript is empty on streaming mode

GPU - RTX 2080 Ti
Operating System: Ubuntu 18.04
Riva Version: 2.7.0

How to reproduce the issue ?

Hi! I have NeMo model (citrinet 512) and I followed Riva Overview to deploy my model to Riva, steps I did :

  1. start servicemaker
  2. nemo2riva
  3. riva-build
riva-build speech_recognition \   
/servicemaker-dev/asr_online_beam_model_experiment.rmir:tlt_encode \    
/servicemaker-dev/asr_0_1.riva:tlt_encode \   
--streaming=True \    
--name=citrinet-en-US-asr-streaming \   
--decoder_type=flashlight \   
--decoding_language_model_binary=/servicemaker-dev/lm.binary \  
--decoding_vocab=/servicemaker-dev/vocab.txt \  
--language_code=en-US \
  1. deploy with
Waiting for Riva server to load all models...retrying in 10 seconds
Riva server is ready...

for offline inference i can get good transcript
but if i try to do online inference i only get empty transcript

riva_streaming_asr_client --audio_file httpswwwyoutubecomwatchvarOnNqlWGE8_chunk001.wav --language_code=en-US
I1202 01:40:22.248736   129] Using Insecure Server Credentials
Loading eval dataset...
filename: /opt/riva/httpswwwyoutubecomwatchvarOnNqlWGE8_chunk001.wav
Done loading 1 files
File: /opt/riva/httpswwwyoutubecomwatchvarOnNqlWGE8_chunk001.wav

Final transcripts: 

Audio processed: -7.37696e+37 sec.

Not printing latency statistics because the client is run without the --simulate_realtime option and/or the number of requests sent is not equal to number of requests received. To get latency statistics, run with --simulate_realtime and set the --chunk_duration_ms to be the same as the server chunk duration
Run time: 0.156079 sec.
Total audio processed: 6.421 sec.
Throughput: 41.1393 RTFX

and this is from the server

I1202 01:40:22.251299   189] ASRService.StreamingRecognize called.
I1202 01:40:22.251327   189] ASRService.StreamingRecognize performing streaming recognition with sequence id: 1375652540
I1202 01:40:22.251346   189] Using model citrinet-en-US-asr-streaming for inference
I1202 01:40:22.251386   189] Model sample rate= 16000 for inference
I1202 01:40:22.251515   189] Detected format: encoding = 1 numchannels = 1 samplerate = 16000 bitspersample = 16
I1202 01:40:22.406829   189] ASRService.StreamingRecognize returning OK

I already tried to use --nn.use_trt_fp32, but i got error because the layer uses fp16.

Error Code 4: Internal Error (fp16 precision has been set for a layer or layer output, but fp16 is not configured in the builder)
[12/02/2022-03:37:22] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )

so i removed --nn.use_trt_fp32, my model is deployed and the server starts successfully, but i get empty transcript. How to solve this empty transcript problem?

Hi @junko_ran

Thanks for your interest in Riva

Thanks for sharing the details,
In addition, Kindly request to provide the following details

  1. NGC Link to the nemo model used


Hi @rvinobha !
sorry for the late reply, my problem is solved!
what i did is change riva-build arguments with the one from citrinet_256 streaming in this pipeline page
Thanks for the response!

hi can you share the command you used eventually?

Hi @emailmeemilyong,
here you go!

riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --name=citrinet-512-en-US-asr-streaming \
   --ms_per_timestep=80 \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --endpointing.residue_blanks_at_start=-2 \
   --chunk_size=0.16 \
   --left_padding_size=1.92 \
   --right_padding_size=1.92 \
   --decoder_type=flashlight \
   --flashlight_decoder.asr_model_delay=-1 \
   --decoding_language_model_binary=<lm_binary> \
   --decoding_vocab=<decoder_vocab_file> \
   --flashlight_decoder.lm_weight=0.2 \
   --flashlight_decoder.word_insertion_score=0.2 \
   --flashlight_decoder.beam_threshold=20. \
   --flashlight_decoder.num_tokenization=1 \
   --max_batch_size=1 \
   --featurizer.max_execution_batch_size=1 \
   --endpointing.max_batch_size=1 \
   --nn.opt_batch_size=1 \
   --flashlight_decoder.max_execution_batch_size=1 \

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.