Is there a duration limit for the audio file ? What is the maximum length that a model can get?
→ Conformer-CTC uses self-attention which needs significant memory for large sequences. We trained the model with sequences up to 20s and they work for larger sequences but memory may not allow to go very large. For such large sequences two options are available:
1-Segment the sequence into smaller parts, perform the inference, merge the results.
2-Use Citrinet as its memory consumption in linear to the input length and can handle large audios in one shot.
Regarding the sample rate - the conformer requires 16k. For data that is 8k, should it be resampled 16k in order to meet the requirements? In this notebook, it is mentioned that loading the weights of a 16 kHz model as initialization helps the model to converge faster with better accuracy. Does the 16k model perform better on 8k data (that was resampled to 16k) than the 8k model?
→ The best accuracy possible to get is with a model that was initialized with weights from one of our pretrained models, this is true for both upsampled 16 kHz data and 8 kHz data. Upsampling the data and keeping the model at 16 kHz also helps the fine-tuning converge faster.
For CTC models, in order to transcribe very long audio sequences, you can follow the streaming tutorial here.
In order to use on a large dataset, you can use: link
For Transducer Models, we currently do not support streaming inference in Nemo, but we are actively working on it. Also note that transducer Models are not currently supported in RIVA.