Hi team, I’m facing a very common issue, my input audio for transcription using stt_en_conformer_ctc_small
is too long.
Researching here in the forum, the default answer is somewhat like this:
(…) workaround would be chopping your audio file into several shorter clips and running the program.
What’s the state-of-art technique you are using for splitting the audio segments without losing the spoken words? Do you have any working example?
I can see we have an example like this: NeMo/tutorials/asr/Streaming_ASR.ipynb at stable · NVIDIA/NeMo · GitHub but I’ll need to keep the timestamps because I’ll merge the transcription with the Speaker diarization
Hi, to answer your question: in general, for splitting audio segments, you can use the NeMo Forced Aligner. There is some information on it here (NeMo/tools/nemo_forced_aligner at main · NVIDIA/NeMo · GitHub) and a tutorial here (Google Colab).There is a caveat: NFA requires use of an ASR model. If you are running out of memory when trying transcription, you will also run out of memory when generating timestamps for the segments. However, there are more memory-efficient models you can use for both transcription and alignment. stt_en_conformer_ctc_small
is a conformer model, which is not memory-efficient. We suggest using a “fastconformer” model e.g. stt_en_fastconformer_ctc_large
or nvidia/parakeet-ctc-0.6b
. You may need to make sure you use the local attention (e.g. after loading the model, run (eg model.change_attention_model(self_attention_model="rel_pos_local_attn", att_context_size=[64, 64])
), which will reduce memory consumption further.If you use a fastconformer model as suggested, you will likely find that the transcription will work out of the box without needing to do any audio segment splitting.