Develop Smaller Speech Recognition Models with NVIDIA’s NeMo Framework

Originally published at:

As computers and other personal devices have become increasingly prevalent, interest in conversational AI has grown due to its multitude of potential applications in a variety of situations. Each conversational AI framework is comprised of several more basic modules such as automatic speech recognition (ASR), and the models for these need to be lightweight in…

For all the talk about the edge, the article fails to describe inference times or edge hardware requirements. Does it run on jetson nano? Rasberri pi? How much ram? Etc

Yes, QuartzNet inference in NeMo does run on Jetson Nano. We never tried Rasberri pi though. Note that QuartzNet is an architecture - e.g. QuartzNet15x5 has B=15 blocks with R=5 sub-blocks within each block. See . To lessen memory footprint you can chose to have less blocks and/or subblocks, but then you will have to re-train yourself. Another (very effective) way to reduce memory footprint is to give it audio in shorter segments.

Can I make an inference using NeMo on wav2letter? Does the library have methods to do it?

We don't have wav2letter model in NeMo, but Jasper model is similar to it

Sorry I did not explain myself correctly. I was referring to if the NeMo library has methods to make inference with a model?

yes. we also provide high-quality pre-trained checkpoints for QuartzNet

Take a look at

I have a question about retraining the NeMo QuartzNet 15x5. I would like to re-train on my own data set, but I don’t know any information, for example, what is the minimum and maximum size of the audio that QuartzNet 15x5 supports? What are the formats that QuartzNet 15x5 supports apart from wav?

There’s no set minimum or maximum audio length for training (other than what’s limited by your GPU memory), but we tend to use a rule of thumb of 0.1s to around 17s.

As for other formats, this PR introduced support for additional audio formats (including MP3, Ogg, etc.) via Pydub. Wav is the most well-tested audio format, so please let us know if you run into bugs/problems with any other formats by opening an issue.

1 Like