Develop Smaller Speech Recognition Models with NVIDIA’s NeMo Framework

Originally published at: Develop Smaller Speech Recognition Models with NVIDIA’s NeMo Framework | NVIDIA Technical Blog

As computers and other personal devices have become increasingly prevalent, interest in conversational AI has grown due to its multitude of potential applications in a variety of situations. Each conversational AI framework is comprised of several more basic modules such as automatic speech recognition (ASR), and the models for these need to be lightweight in…

For all the talk about the edge, the article fails to describe inference times or edge hardware requirements. Does it run on jetson nano? Rasberri pi? How much ram? Etc

Yes, QuartzNet inference in NeMo does run on Jetson Nano. We never tried Rasberri pi though. Note that QuartzNet is an architecture - e.g. QuartzNet15x5 has B=15 blocks with R=5 sub-blocks within each block. See https://nvidia.github.io/Ne... . To lessen memory footprint you can chose to have less blocks and/or subblocks, but then you will have to re-train yourself. Another (very effective) way to reduce memory footprint is to give it audio in shorter segments.

Can I make an inference using NeMo on wav2letter? Does the library have methods to do it?

We don't have wav2letter model in NeMo, but Jasper model is similar to it

Sorry I did not explain myself correctly. I was referring to if the NeMo library has methods to make inference with a model?

yes. we also provide high-quality pre-trained checkpoints for QuartzNet

Take a look at https://github.com/NVIDIA/N...

I have a question about retraining the NeMo QuartzNet 15x5. I would like to re-train on my own data set, but I don’t know any information, for example, what is the minimum and maximum size of the audio that QuartzNet 15x5 supports? What are the formats that QuartzNet 15x5 supports apart from wav?

There’s no set minimum or maximum audio length for training (other than what’s limited by your GPU memory), but we tend to use a rule of thumb of 0.1s to around 17s.

As for other formats, this PR introduced support for additional audio formats (including MP3, Ogg, etc.) via Pydub. Wav is the most well-tested audio format, so please let us know if you run into bugs/problems with any other formats by opening an issue.

1 Like

Hello everyone,

Per “QuartzNet replaces Jasper’s 1D convolutions with 1D time-channel separable convolutions, which use many fewer parameters.”, what is the exact difference between the two? I might’ve gotten this wrong but 1D time-channel separable convolutions is basically two 1D convolutions with respect to time and channel, right? However, what about the regular 1D convolutions? What does it consist of?

Also,

…with five blocks that repeat fifteen times plus four additional convolutional layers

should be “…with five blocks that repeat three times with five subblock plus four additional convolutional layers”, right?

Thanks in advance!

Hi! 1D convolutions and 1D time-channel separable convolutions perform a roughly comparable operation (across time and channel of the input), but the latter splits it into two steps to save on parameters.

In a normal 1D convolution, you’ll have K*c_in*c_out params since you need c_out kernels for every input channel c_in, multiplied by the number of params in the kernel.

In a 1D time-channel separable convolution, we do the 1D convolution across each channel separately (c_in*K params), then a pointwise convolution for each time frame across all channels (c_in*c_out params), for a total of c_in*K + c_in*c_out parameters. It’s a little messy to try to explain without any images, so here’s a diagram I made a while ago that might help visualize this:

Re the 15, yep, that’s a typo… That should be five blocks that repeat three times each.

It’s been a few years and things are fuzzier than I’d like, but hopefully this helps!