ASR - Conformer -CTC: Audio File length and sampling rate

ybenkhoui1 · October 26, 2021, 4:59pm

Is there a duration limit for the audio file ? What is the maximum length that a model can get?
→ Conformer-CTC uses self-attention which needs significant memory for large sequences. We trained the model with sequences up to 20s and they work for larger sequences but memory may not allow to go very large. For such large sequences two options are available:
1-Segment the sequence into smaller parts, perform the inference, merge the results.
2-Use Citrinet as its memory consumption in linear to the input length and can handle large audios in one shot.

Regarding the sample rate - the conformer requires 16k. For data that is 8k, should it be resampled 16k in order to meet the requirements? In this notebook, it is mentioned that loading the weights of a 16 kHz model as initialization helps the model to converge faster with better accuracy. Does the 16k model perform better on 8k data (that was resampled to 16k) than the 8k model?
→ The best accuracy possible to get is with a model that was initialized with weights from one of our pretrained models, this is true for both upsampled 16 kHz data and 8 kHz data. Upsampling the data and keeping the model at 16 kHz also helps the fine-tuning converge faster.
For CTC models, in order to transcribe very long audio sequences, you can follow the streaming tutorial here.
In order to use on a large dataset, you can use: link
For Transducer Models, we currently do not support streaming inference in Nemo, but we are actively working on it. Also note that transducer Models are not currently supported in RIVA.

spolisetty · October 27, 2021, 4:29am

Thank you for sharing the links.

ryein · April 24, 2023, 7:36pm

I was working on training with the conformer ctc xlarge model. It is a huge model and I could only do a batch of two after I did extensive data segmentation to cut down phrase segments to around less than 8 seconds. Even then a batch size of only 2 is all I could do with my 3090s. Luckily with lighting I had 2 3090s so it could go kinda fast. Even then while training the predictions were pretty terrible.

Also it seems like conformer hallucinates a lot. Citrinet is much more reliable at least on noisy audio.

Is this a result of my low batch size? Does it need a larger batch size to predict well during the training process?

Topic		Replies	Views
Error when running Conformer-CTC model in Riva 1.8.0b0 Riva	3	995	January 9, 2022
Adding a language model (LM) on top of the ASR - Conformer CTC Riva nemo	1	1498	October 26, 2021
[BUG] Conformer CTC streaming ASR with timestamps enabled returns weird start time of first word Riva nvbugs	14	1260	July 25, 2022
Inference Broken - Long Form Audio and gRPC max message sizes Riva	10	2093	October 18, 2021
RIVA ASR trained Conformer-CTC (using nemo) Output Merge Issue Riva	0	566	March 15, 2022
Finetuning Nemo Model Frameworks nemo	3	797	November 14, 2024
Chunk size in TTS? Riva	0	563	June 10, 2023
Riva support for Conformer CTC model train by Nemo Riva nemo , riva	2	930	December 22, 2021
Rebuilding the asrset3 citrinet offline pipeline but with larger chunk size Riva	10	1311	February 16, 2022
Is it planned to support Conformer-CTC models on the Jarvis toolkit? Riva riva	4	853	July 26, 2021

ASR - Conformer -CTC: Audio File length and sampling rate

Related topics