I"m using the CTC conformer models in Spanish (es-US) to do streaming recognition through a telephone line. However, when there is background noise, spurious words appear in ASR transcriptions. In the releases it is mentioned that there is an option to use the neural-based voice activity detector to avoid this problem, how can I use it? Is there any other way to suppress the noise without doing fine-tuning?
Yes, that is the model I am using, however the noise affects transcriptions. Sometimes it transcribes background noise when no one is speaking, and other times it transcribes both, noise and user audio. Is there a filter or some kind of score for streaming recognition?
same problem. Something new? Could using some model like vad_telephony_marblenet solve the problem? Or is vad_telephony_marblenet very outdated and not trained for Spanish?