Final transcript is empty on streaming mode

GPU - RTX 2080 Ti
Operating System: Ubuntu 18.04
Riva Version: 2.7.0

How to reproduce the issue ?

Hi! I have NeMo model (citrinet 512) and I followed Riva Overview to deploy my model to Riva, steps I did :

  1. start servicemaker
  2. nemo2riva
  3. riva-build
riva-build speech_recognition \   
/servicemaker-dev/asr_online_beam_model_experiment.rmir:tlt_encode \    
/servicemaker-dev/asr_0_1.riva:tlt_encode \   
--streaming=True \    
--name=citrinet-en-US-asr-streaming \   
--decoder_type=flashlight \   
--decoding_language_model_binary=/servicemaker-dev/lm.binary \  
--decoding_vocab=/servicemaker-dev/vocab.txt \  
--language_code=en-US \
--nn.fp16_needs_obey_precision_pass
  1. deploy with riva_init.sh
  2. riva_start.sh
Waiting for Riva server to load all models...retrying in 10 seconds
Riva server is ready...

for offline inference i can get good transcript
but if i try to do online inference i only get empty transcript

riva_streaming_asr_client --audio_file httpswwwyoutubecomwatchvarOnNqlWGE8_chunk001.wav --language_code=en-US
I1202 01:40:22.248736   129 riva_streaming_asr_client.cc:154] Using Insecure Server Credentials
Loading eval dataset...
filename: /opt/riva/httpswwwyoutubecomwatchvarOnNqlWGE8_chunk001.wav
Done loading 1 files
-----------------------------------------------------------
File: /opt/riva/httpswwwyoutubecomwatchvarOnNqlWGE8_chunk001.wav

Final transcripts: 

Audio processed: -7.37696e+37 sec.
-----------------------------------------------------------

Not printing latency statistics because the client is run without the --simulate_realtime option and/or the number of requests sent is not equal to number of requests received. To get latency statistics, run with --simulate_realtime and set the --chunk_duration_ms to be the same as the server chunk duration
Run time: 0.156079 sec.
Total audio processed: 6.421 sec.
Throughput: 41.1393 RTFX

and this is from the server

I1202 01:40:22.251299   189 grpc_riva_asr.cc:935] ASRService.StreamingRecognize called.
I1202 01:40:22.251327   189 grpc_riva_asr.cc:962] ASRService.StreamingRecognize performing streaming recognition with sequence id: 1375652540
I1202 01:40:22.251346   189 grpc_riva_asr.cc:1019] Using model citrinet-en-US-asr-streaming for inference
I1202 01:40:22.251386   189 grpc_riva_asr.cc:1035] Model sample rate= 16000 for inference
I1202 01:40:22.251515   189 riva_asr_stream.cc:214] Detected format: encoding = 1 numchannels = 1 samplerate = 16000 bitspersample = 16
I1202 01:40:22.406829   189 grpc_riva_asr.cc:1136] ASRService.StreamingRecognize returning OK

I already tried to use --nn.use_trt_fp32, but i got error because the layer uses fp16.

Error Code 4: Internal Error (fp16 precision has been set for a layer or layer output, but fp16 is not configured in the builder)
[12/02/2022-03:37:22] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )

so i removed --nn.use_trt_fp32, my model is deployed and the server starts successfully, but i get empty transcript. How to solve this empty transcript problem?

Hi @junko_ran

Thanks for your interest in Riva

Thanks for sharing the details,
In addition, Kindly request to provide the following details

  1. NGC Link to the nemo model used
  2. config.sh

Thanks

Hi @rvinobha !
sorry for the late reply, my problem is solved!
what i did is change riva-build arguments with the one from citrinet_256 streaming in this pipeline page
Thanks for the response!

hi can you share the command you used eventually?

Hi @emailmeemilyong,
here you go!

riva-build speech_recognition \
   <rmir_filename>:<key> <riva_filename>:<key> \
   --name=citrinet-512-en-US-asr-streaming \
   --ms_per_timestep=80 \
   --featurizer.use_utterance_norm_params=False \
   --featurizer.precalc_norm_time_steps=0 \
   --featurizer.precalc_norm_params=False \
   --endpointing.residue_blanks_at_start=-2 \
   --chunk_size=0.16 \
   --left_padding_size=1.92 \
   --right_padding_size=1.92 \
   --decoder_type=flashlight \
   --flashlight_decoder.asr_model_delay=-1 \
   --decoding_language_model_binary=<lm_binary> \
   --decoding_vocab=<decoder_vocab_file> \
   --flashlight_decoder.lm_weight=0.2 \
   --flashlight_decoder.word_insertion_score=0.2 \
   --flashlight_decoder.beam_threshold=20. \
   --flashlight_decoder.num_tokenization=1 \
   --max_batch_size=1 \
   --featurizer.max_execution_batch_size=1 \
   --endpointing.max_batch_size=1 \
   --nn.opt_batch_size=1 \
   --nn.max_batch_size=1
   --flashlight_decoder.max_execution_batch_size=1 \
   --language_code=en-US

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.