Marblenet vad for real time streaming applications


I have been trying to use nvidia’s marblenet for voice activity detection for real time audio and have run into some trouble.

following the notebook from nemo’s github, specifically the part talking about online microphone inference. When testing with some of my data I get inconsistent results. The probabilities of speech and non speech are very close to each other, reaching a verdict by a very thin margin (around 0.01), icreasing the threshold to anything above 0.5 results in constant non-speech labels.

Any insights are welcome!