Low transcription accuracy for long audio files with the Whisper model

serhii-artemuk · July 31, 2025, 8:41am

Please provide the following information when requesting support.

Hardware - GPU (L4)
Hardware - CPU - 8 vCPUs
Operating System - Linux/Ubuntu
Riva Version - 2.19

Hello. I noticed lower accuracy of the whisper-large-v3-trubo model when processing long audio files (2-3 minutes on average) compared to processing the same audio files cut into smaller chunks (less than 30 seconds). The difference in accuracy in my case is 10%. From the logs and transcription, I can see that whisper processes the input audio in 30-second chunks. There are cases where this cut falls on a long German word, and then whisper transcribes it incorrectly because it only has part of it. And the next 30-second segment misses 3-4 words at the beginning. Is this expected behaviour? If so, how can this be avoided, apart from submitting shorter segments? If not, what could be the reason for this behaviour in my case? I used these configurations for the build:

riva-build speech_recognition <rmir_filename>:<key>  \
  <riva_file>:<key> \
  --offline  \
  --name=whisper-large-v3-turbo-multi-asr-offline \
  --return_separate_utterances=True \
  --unified_acoustic_model  \
  --chunk_size 30 \
  --left_padding_size 0 \
  --right_padding_size 0  \
  --decoder_type trtllm  \
  --feature_extractor_type torch \
  --torch_feature_type whisper \
  --featurizer.norm_per_feature false \
  --max_batch_size 8 \
  --featurizer.precalc_norm_params False  \
  --featurizer.max_batch_size=8 \
  --featurizer.max_execution_batch_size=8 \
  --language_code=en,zh,de,es,ru,ko,fr,ja,pt,tr,pl,ca,nl,ar,sv,it,id,hi,fi,vi,he,uk,el,ms,cs,ro,da,hu,ta,no,th,ur,hr,bg,lt,la,mi,ml,cy,sk,te,fa,lv,bn,sr,az,sl,kn,et,mk,br,eu,is,hy,ne,mn,bs,kk,sq,sw,gl,mr,pa,si,km,sn,yo,so,af,oc,ka,be,tg,sd,gu,am,yi,lo,uz,fo,ht,ps,tk,nn,mt,sa,lb,my,bo,tl,mg,as,tt,haw,ln,ha,ba,jw,su,yue,multi

Topic		Replies	Views
Nvidia RIVA fails to infer full audio chunk Riva	3	735	April 10, 2023
Riva on Whisper Large v3 returns only part transcription Riva	7	121	January 23, 2025
Issue with TensorRT Whisper consuming inordinate amount of system memory which kills the process TensorRT tensorrt , cuda , pytorch , cudnn	2	782	December 14, 2024
Chopping the long audios for transcription Access/Accounts cuda , speech	1	77	March 6, 2025
Chunk size in TTS? Riva	0	614	June 10, 2023
Riva ASR transcript cut off? Riva	11	1349	March 20, 2022
Attempt to transcribe audio file fails (detected audio length is 0) Riva	2	473	February 3, 2024
ASR - Conformer -CTC: Audio File length and sampling rate Riva nemo , riva	2	1625	April 24, 2023
How chunk size, padding size, and other build configs affect behavior of streaming ASR Riva	4	1565	July 18, 2023
Words joined together in transcription Riva	2	458	February 2, 2023

Low transcription accuracy for long audio files with the Whisper model

Related topics