Utterance boundary detection in NIM Riva ASR (RNNT streaming)

Hi,

I am starting to evaluate NVIDIA NIM Riva ASR and have a basic question
about utterance boundary detection.

I found the Riva ASR documentation about
“Beginning / End of Utterance Detection” (stop_history, silence_duration_ms, etc.),
and I was wondering if similar controls are available when using NIM,
especially with Parakeet RNNT Multilingual in streaming mode.

From a quick test, it seems that FINAL results are not always emitted,
so I am not sure if these endpointing parameters are supported in NIM or not.

Could you clarify:

  • Are utterance boundary / endpointing parameters configurable in NIM?
  • Or is it expected that applications handle end-of-utterance detection
    outside of NIM when using RNNT models?

Thanks in advance.

The support of client side configurable stop_history is there. We are working on improving it and will be available in next Parakeeet-RNNT NIM release.