Issues with Speaker Diarization in Riva ASR - when using model Whisper, Conformer-CTC

Hardware - GPU (A2)
Riva Version: 2.19.0

I am deploying Whisper, Conformers-CTC, and Speaker Diarization models using the Riva SDK. However, I am not receiving any output from the diarizer. I am using riva-quickstart-2.19.0 with the following deployment configurations:

  • Deploy Whisper model:
riva-build speech_recognition /data/rmir/whisper_large_v3.rmir:tlt_encode  \
  /data/riva_model/whisper_large_v3.riva:tlt_encode \
  --offline  \
  --name=whisper-large-v3-turbo-multi-asr-offline \
  --return_separate_utterances=True \
  --unified_acoustic_model  \
  --chunk_size 30 \
  --left_padding_size 0 \
  --right_padding_size 0  \
  --decoder_type trtllm  \
  --feature_extractor_type torch \
  --torch_feature_type whisper \
  --featurizer.norm_per_feature false \
  --max_batch_size 8 \
  --featurizer.precalc_norm_params False  \
  --featurizer.max_batch_size=8 \
  --featurizer.max_execution_batch_size=8 \
  --language_code=en,zh,de,es,ru,ko,fr,ja,pt,tr,pl,ca,nl,ar,sv,it,id,hi,fi,vi,he,uk,el,ms,cs,ro,da,hu,ta,no,th,ur,hr,bg,lt,la,mi,ml,cy,sk,te,fa,lv,bn,sr,az,sl,kn,et,mk,br,eu,is,hy,ne,mn,bs,kk,sq,sw,gl,mr,pa,si,km,sn,yo,so,af,oc,ka,be,tg,sd,gu,am,yi,lo,uz,fo,ht,ps,tk,nn,mt,sa,lb,my,bo,tl,mg,as,tt,haw,ln,ha,ba,jw,su,yue,multi
  • Deploy Diarizer model:
riva-build diarizer \
  /data/rmir/diarizer.rmir:tlt_encode \
  /data/riva_model/vad_multilingual_marblenet_v1.10.0.riva:tlt_encode \
  /data/riva_model/titanet_small_v1.0.0.riva:tlt_encode \
  --vad_type=neural \
  --diarizer_backend.offline \
  --diarizer_backend.optimization_graph_level=-1 \
  --embedding_extractor_nn.max_batch_size=32 \
  --embedding_extractor_nn.use_onnx_runtime \
  --embedding_extractor_nn.optimization_graph_level=-1 \
  --clustering_backend.max_batch_size=0 \
  --chunk_size=300 \
  --audio_sec_limit=4001 \
  --diarizer_backend.language_code=generic
  • Inference code:
import io
import wave
import grpc
import riva.client

def load_audio(path):
    with wave.open(path, 'rb') as wf:
        sample_rate = wf.getframerate()
        channels = wf.getnchannels()
        frames = wf.getnframes()
        audio = wf.readframes(frames)
    return audio, sample_rate, channels

def main():
    auth = riva.client.Auth(uri='localhost:8005')
    riva_asr = riva.client.ASRService(auth)

    path = "audio.wav"
    content, sr, channels = load_audio(path)
    with open(path, 'rb') as f:
        content = f.read()

    config = riva.client.RecognitionConfig(
        encoding=riva.client.AudioEncoding.LINEAR_PCM,
        sample_rate_hertz=sr,
        audio_channel_count=channels,
        language_code="multi",
        max_alternatives=1,
        enable_automatic_punctuation=False,
        enable_word_time_offsets=True,
    )
    riva.client.add_custom_configuration_to_config(config, 'enable_vad_endpointing:true')
    riva.client.add_speaker_diarization_to_config(
        config,
        diarization_enable=True,
        diarization_max_speakers=2
    )

    try:
        response = riva_asr.offline_recognize(content, config)
    except grpc.RpcError as e:
        print("RPC Error:", e.details())
        return

    print("\nASR Transcript with Speaker Diarization: ", response)
    for result in response.results:
        for word in result.alternatives[0].words:
            print(f"[SPK{word.speaker_tag}] {word.word}", end=' ')
    print()

if __name__ == "__main__":
    main()
  • Despite enabling speaker diarization, I do not receive any speaker tags in the output. Docker logs show the following:
I0704 07:15:03.842489   556 grpc_riva_asr.cc:685] ASRService.Recognize called.
I0704 07:15:03.843019  2081 grpc_riva_asr.cc:905] ASRService.Recognize diarization called.
I0704 07:15:03.843521  2081 riva_asr_stream.cc:226] Detected format: encoding = 1 numchannels = 1 samplerate = 22050 bitspersample = 16
W0704 07:15:03.843725  2081 grpc_riva_asr.cc:1055] Could not get parameter append_space_to_transcripts from model riva-diarizer. A space will be added after utterances by default.
I0704 07:15:03.843734  2081 grpc_riva_asr.cc:1060] Using model riva-diarizer from Triton localhost:8001 for diarization inference
I0704 07:15:03.843801  2080 grpc_riva_asr.cc:854] Using model whisper-large-v3-turbo-multi-asr-offline-asr-bls-ensemble from Triton localhost:8001 for inference
I0704 07:15:03.892876  2083 grpc_riva_asr.cc:1143] Creating resampler, audio file sample rate=22050 model sample_rate=16000
I0704 07:15:06.496675  2081 grpc_riva_asr.cc:1102] ASRService.Recognize diarization returning OK
I0704 07:15:06.497761   556 stats_builder.h:100] {"specversion":"1.0","type":"riva.asr.recognize.v1","source":"","subject":"","id":"58009794-6bea-480b-ae21-b21ad8213fc4","datacontenttype":"application/json","time":"2025-07-04T07:15:03.842467682+00:00","data":{"release_version":"2.19.0","customer_uuid":"","ngc_org":"","ngc_team":"","ngc_org_team":"","container_uuid":"","language_code":"multi","request_count":1,"audio_duration":30.0,"speech_duration":0.0,"status":0,"err_msg":""}}

“Currently, Sortformer speaker diarization is supported only with the Parakeet-CTC and Conformer-CTC ASR models in streaming mode.” Checkout the documentation : How do I Use Speaker Diarization with Riva ASR? — NVIDIA Riva

In the upcoming release streaming mode will be supported with SD.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.