ASR - Conformer -CTC: Audio File length and sampling rate

Is there a duration limit for the audio file ? What is the maximum length that a model can get?
ā†’ Conformer-CTC uses self-attention which needs significant memory for large sequences. We trained the model with sequences up to 20s and they work for larger sequences but memory may not allow to go very large. For such large sequences two options are available:
1-Segment the sequence into smaller parts, perform the inference, merge the results.
2-Use Citrinet as its memory consumption in linear to the input length and can handle large audios in one shot.

Regarding the sample rate - the conformer requires 16k. For data that is 8k, should it be resampled 16k in order to meet the requirements? In this notebook, it is mentioned that loading the weights of a 16 kHz model as initialization helps the model to converge faster with better accuracy. Does the 16k model perform better on 8k data (that was resampled to 16k) than the 8k model?
ā†’ The best accuracy possible to get is with a model that was initialized with weights from one of our pretrained models, this is true for both upsampled 16 kHz data and 8 kHz data. Upsampling the data and keeping the model at 16 kHz also helps the fine-tuning converge faster.
For CTC models, in order to transcribe very long audio sequences, you can follow the streaming tutorial here.
In order to use on a large dataset, you can use: link
For Transducer Models, we currently do not support streaming inference in Nemo, but we are actively working on it. Also note that transducer Models are not currently supported in RIVA.

1 Like

Thank you for sharing the links.

I was working on training with the conformer ctc xlarge model. It is a huge model and I could only do a batch of two after I did extensive data segmentation to cut down phrase segments to around less than 8 seconds. Even then a batch size of only 2 is all I could do with my 3090s. Luckily with lighting I had 2 3090s so it could go kinda fast. Even then while training the predictions were pretty terrible.

Also it seems like conformer hallucinates a lot. Citrinet is much more reliable at least on noisy audio.

Is this a result of my low batch size? Does it need a larger batch size to predict well during the training process?