Why CER is very high when serving NeMo model in Riva

Hardware - GPU L4
Riva Version - 1.18.0
NeMo - 1.23.0
Nemo2Riva - 1.18.0

I finetune the parakeet-tdt_ctc-0.6b-ja model with my custom dataset. Got test/validation character error rate (CER) ~17% with CTC decoder

After getting this CER I,

  1. Extract the CTC head from this NeMo model
  2. Convert this to .riva
  3. Build an offline Riva model with a greedy decoder
  4. Serve the Riva model
  5. Take a transcript from this offline Riva-deployed model on the same test/validation dataset.

This time I’m getting CER ~27%. I did not get any clue why the ~10% CER jump for the Riva model. Is it expected behavior?

For the streaming, low-latency, the CER jumped to 33%. For the Riva build, I use the default configuration from the Riva pipeline configs

I also tried to fine-tune a conformer-CTC model and get similar behavior. 10 to 15% CER jump in the Riva model

If you need any other information regarding this, please let me know.

Thanks for your help

Can you share the build command you used?

@mayjain Thanks for your reply.

I use this Riva build command for offline STT

riva-build speech_recognition -f \
    "/servicemaker-dev/$RMIR_MODEL:tlt_encode"\
    "/servicemaker-dev/$RIVA_MODEL:tlt_encode"\
    --offline \
    --name=parakeet-0.6b-unified-ml-cs-es-ja-JP-asr-offline \
    --return_separate_utterances=True \
    --featurizer.use_utterance_norm_params=False \
    --featurizer.precalc_norm_time_steps=0 \
    --featurizer.precalc_norm_params=False \
    --ms_per_timestep=80 \
    --endpointing.residue_blanks_at_start=-16 \
    --nn.fp16_needs_obey_precision_pass \
    --unified_acoustic_model \
    --chunk_size=4.8 \
    --left_padding_size=1.6 \
    --right_padding_size=1.6 \
    --featurizer.max_batch_size=256 \
    --featurizer.max_execution_batch_size=256 \
    --decoder_type=greedy \
    --greedy_decoder.asr_model_delay=-1 \
    --language_code=ja-JP \
    --force

@mayjain I also tried streaming configs with

--name=parakeet-0.6b-unified-ml-cs-es-ja-JP-asr-streaming \
    --return_separate_utterances=False \
    --featurizer.use_utterance_norm_params=False \
    --featurizer.precalc_norm_time_steps=0 \
    --featurizer.precalc_norm_params=False \
    --ms_per_timestep=80 \
    --endpointing.residue_blanks_at_start=-16 \
    --nn.fp16_needs_obey_precision_pass \
    --unified_acoustic_model \
    --chunk_size=0.32 \
    --left_padding_size=3.92 \
    --right_padding_size=3.92 \
    --decoder_type=greedy \
    --greedy_decoder.asr_model_delay=-1 \
    --append_space_to_transcripts=False \
    --language_code=ja-JP \
    --force

For this I get CER: ~28%

To extract CTC head I use this script NeMo/examples/asr/asr_hybrid_transducer_ctc/helpers/convert_nemo_asr_hybrid_to_ctc.py at main · NVIDIA-NeMo/NeMo · GitHub

Can you try our latest Parakeet CTC NIM containers.
I am not getting such spike in CER in latest containers.

Could you please share the Parakeet NIM model for the Japanese language

For the Riva build, I use this image

nvcr.io/nvidia/riva/riva-speech:2.18.0

You can checkout NIM docs on how to deploy a custom NIM.
CONTAINER_ID = parakeet-1-1b-ctc-en-us

Thanks. I will try NIM deployment and let you know the update