VAD / Endpointing configurations

Please provide the following information when requesting support.

Hardware - T4
Hardware - CPU x86_64
Operating System - Debian GNU/Linux 11 (bullseye)
Riva Version: 2.12.1
TLT Version (if relevant)
How to reproduce the issue ? (This is for errors. Please share the command and the detailed log here)

Hi! I’m building the speech-recognition pipeline, with the en-US conformer with the high throughput configurations.
I’m also using the marblenet VAD, and I have adjusted the endpointing to have a 2000ms stop_history.

I have some questions/issue i would love to resolve:

  1. The ASR/VAD are extremely sensitive to speech, even it is very very far away from the microphone, is there some kind of configuration to adjust the sensitivity such that a higher volume would be needed by the speaker? (i.e filter out background noise / babble ? )
  2. Is there a way to force a final result after X amount of seconds? I.E after 20 seconds finalize and give a final result for the current recognition stream?
  3. I saw that there exists a Nemo marblenet telephony VAD, which I did convert but it seems feature dimensions of this VAD do not work with the feature dimensions of the Conformer ? Is there something specific I need to do to make them work together?

Many thanks.

Additionaly, is the correct way to work using both endpointing and VAD or only one of them? documentation is quite sparse regarding this.

Do I need to specify anything in the configuration when doing inference or will VAD automatically be chosen if it was passed to the riva build speech-recognition paramters?