Chopping the long audios for transcription

marcos.lima · March 5, 2025, 12:05pm

Hi team, I’m facing a very common issue, my input audio for transcription using stt_en_conformer_ctc_small is too long.

Researching here in the forum, the default answer is somewhat like this:

(…) workaround would be chopping your audio file into several shorter clips and running the program.

What’s the state-of-art technique you are using for splitting the audio segments without losing the spoken words? Do you have any working example?

I can see we have an example like this: NeMo/tutorials/asr/Streaming_ASR.ipynb at stable · NVIDIA/NeMo · GitHub but I’ll need to keep the timestamps because I’ll merge the transcription with the Speaker diarization

sophwats · March 6, 2025, 4:43pm

Hi, to answer your question: in general, for splitting audio segments, you can use the NeMo Forced Aligner. There is some information on it here (NeMo/tools/nemo_forced_aligner at main · NVIDIA/NeMo · GitHub) and a tutorial here (Google Colab).There is a caveat: NFA requires use of an ASR model. If you are running out of memory when trying transcription, you will also run out of memory when generating timestamps for the segments. However, there are more memory-efficient models you can use for both transcription and alignment. stt_en_conformer_ctc_small is a conformer model, which is not memory-efficient. We suggest using a “fastconformer” model e.g. stt_en_fastconformer_ctc_large or nvidia/parakeet-ctc-0.6b . You may need to make sure you use the local attention (e.g. after loading the model, run (eg model.change_attention_model(self_attention_model="rel_pos_local_attn", att_context_size=[64, 64]) ), which will reduce memory consumption further.If you use a fastconformer model as suggested, you will likely find that the transcription will work out of the box without needing to do any audio segment splitting.

Topic		Replies	Views
ASR - Conformer -CTC: Audio File length and sampling rate Riva nemo , riva	2	1587	April 24, 2023
NVENC with audio conversion - best practice advice GPU-Accelerated Libraries	0	2263	December 23, 2015
Pushing the Boundaries of Speech Recognition with NVIDIA NeMo Parakeet ASR Models Technical Blog	1	252	April 18, 2024
Generating High-Quality Labels for Speech Recognition with Label Studio and NVIDIA NeMo Technical Blog	0	413	May 24, 2021
Nvidia Nemo cuts last part of transcription Deep Learning (Training & Inference)	0	462	June 23, 2020
Audio2Face 2022.1.1 Hotfix released Audio2Face (closed)	5	688	September 28, 2022
Chunk size in TTS? Riva	0	589	June 10, 2023
FFT on CUDA Newbie has lots of problems! CUDA Programming and Performance	2	4703	August 17, 2011
TTS Input text too long Riva	2	583	June 9, 2023
Multilingual and Code-Switched Automatic Speech Recognition with NVIDIA NeMo Technical Blog	0	394	January 31, 2023

Chopping the long audios for transcription

Related topics