Please provide the following information when requesting support.
Hardware - GPU (L4)
Hardware - CPU - 8 vCPUs
Operating System - Linux/Ubuntu
Riva Version - 2.19
Hello. I noticed lower accuracy of the whisper-large-v3-trubo model when processing long audio files (2-3 minutes on average) compared to processing the same audio files cut into smaller chunks (less than 30 seconds). The difference in accuracy in my case is 10%. From the logs and transcription, I can see that whisper processes the input audio in 30-second chunks. There are cases where this cut falls on a long German word, and then whisper transcribes it incorrectly because it only has part of it. And the next 30-second segment misses 3-4 words at the beginning. Is this expected behaviour? If so, how can this be avoided, apart from submitting shorter segments? If not, what could be the reason for this behaviour in my case? I used these configurations for the build:
riva-build speech_recognition <rmir_filename>:<key> \
<riva_file>:<key> \
--offline \
--name=whisper-large-v3-turbo-multi-asr-offline \
--return_separate_utterances=True \
--unified_acoustic_model \
--chunk_size 30 \
--left_padding_size 0 \
--right_padding_size 0 \
--decoder_type trtllm \
--feature_extractor_type torch \
--torch_feature_type whisper \
--featurizer.norm_per_feature false \
--max_batch_size 8 \
--featurizer.precalc_norm_params False \
--featurizer.max_batch_size=8 \
--featurizer.max_execution_batch_size=8 \
--language_code=en,zh,de,es,ru,ko,fr,ja,pt,tr,pl,ca,nl,ar,sv,it,id,hi,fi,vi,he,uk,el,ms,cs,ro,da,hu,ta,no,th,ur,hr,bg,lt,la,mi,ml,cy,sk,te,fa,lv,bn,sr,az,sl,kn,et,mk,br,eu,is,hy,ne,mn,bs,kk,sq,sw,gl,mr,pa,si,km,sn,yo,so,af,oc,ka,be,tg,sd,gu,am,yi,lo,uz,fo,ht,ps,tk,nn,mt,sa,lb,my,bo,tl,mg,as,tt,haw,ln,ha,ba,jw,su,yue,multi