How chunk size, padding size, and other build configs affect behavior of streaming ASR

Hardware - GPU T4
Hardware - CPU (Unknown, AWS g4dn.xlarge instance)
Operating System - Amazon Linux 2 ECS GPU optimized AMI
Riva Version - 2.0

Our team has been working on optimizing the performance of Riva for our use case: Arabic streaming ASR.

Specifically we’re trying to improve the latency of intermediate responses and the accuracy of partial transcripts.
We’re currently using a CitriNet model trained with NeMo.
There are many build parameters riva-build provides, which is great, but there is limited documentation on how each parameter affects the performance of the final model.

We’ve found the parameters that affect the behavior the most are:

  • chunk_size
  • left/right_padding_size
  • vad.vad_stop_history
  • vad.vad_stop_th

The only resources we’ve found are:

  • This Kaldi article on context and chunk size which I believe is similar to the padding and chunk size parameters in Riva.
  • This paper from FAIR, but we’re not sure if their results translate exactly to what Riva provides. A similar analysis for Riva would be extremely helpful!

From our experimentation, we’ve observed the following behaviors:

  • Smaller chunk size: Faster response time, more partial transcripts, less accurate final transcripts.
  • Larger chunk size: Slower response time, less partial transcripts, more accurate final transcripts.
  • Padding size: We’ve had varying results, but usually longer padding sizes result in the model “correcting” itself by modifying previous words.
    Too long, and model corrects itself “incorrectly” (ex. result is “distribution” but later changed to “disruption”).
    Too short, and the final transcript is not as accurate.
  • VAD stop history/threshold: Increasing this allows us to transcribe words with elongation, for example: “Sta…aging” where the “a” is held for 1-2 seconds. While not normal in conventional speech, this is normal for our use case.

This seems to be in line with what the documentation describes as “low-latency” vs. “high-throughput” models, but brings up some questions:

  1. What exactly does the padding do in a streaming context, and why doesn’t the right padding cause a delay in transcription?
  2. When we have increased the padding size, the transcripts become more stable but we see words at the boundary being re-transcribed incorrectly. Why is that and what can we do to mitigate/fix it?

We’re hoping we can get more input from someone on the Riva team to provide more context on how these build configurations specifically, including others, affect the accuracy and latency of the deployed model.
3rd party resources would be helpful here as well.

We’d also like to know which model configurations will fail at runtime, since we’ve found some configurations that result in models that don’t work!
An example of such a model is on with the following configuration:

  • chunk_size: 0.16
  • left/right_padding_size: 1.6
  • vad_start_history: 300
  • vad_start_th: 0.2
  • vad_stop_history: 2000
  • vad_stop_th: 0.98
    All other values are the default.
    This model fails when deployed with Triton with the following error:
UNAVAILABLE: Invalid argument: riva/cbe/asr/feature-extractor/feature-extractor.cc:70]
padding_factor*chunk_size must be greater than left_padding_size+right_padding_size in feature extractor

These configurations should be handled accordingly at build time and not run time.

2 Likes

Hi @pineapple9011

Thanks for your interest in Riva,

Apologies for the delay,

I will check regarding your questions/queries further with the team and get back to you