Riva v2.19 speaker diarization issue

Please provide the following information when requesting support.

Hardware - GPU (H100 HGX & L40S)
Hardware - CPU (AMD Genoa)
Operating System - Ubuntu 22.04 LTS
Riva Version - both v2.16 and v2.19
TLT Version (if relevant) N/A
How to reproduce the issue ? (This is for errors. Please share the command and the detailed log here)

In short, my code (based on the examples from nvidia-riva/python-clients repo) works well with ASR/NMT/TTS via Riva v2.16 container from NGC. But the speaker diarization, modeled after the code at How do I Use Speaker Diarization with Riva ASR? — NVIDIA Riva, runs into the following issues.

the riva v2.16 container is pulled from NGC. I then tried to launch the v2.19 container via riva_qauickstart package which worked fine. But v2.19 docker container has the identical issue.

File “/home/user/miniconda3/envs/riva/lib/python3.9/site-packages/grpc/_channel.py”, line 1006, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = "Error: Unavailable diarizer model requested given these parameters: pipeline_type=diarizer; type=offline; "
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Error: Unavailable diarizer model requested given these parameters: pipeline_type=diarizer; type=offline; “, grpc_status:3, created_time:“2025-04-01T10:06:44.808750416-07:00”}”

on Riva v2.19 docker container:

I0401 17:06:44.767925 518 grpc_riva_asr.cc:685] ASRService.Recognize called.
I0401 17:06:44.768723 130334 grpc_riva_asr.cc:854] Using model parakeet-1.1b-en-US-asr-offline-asr-bls-ensemble from Triton localhost:8001 for inference
I0401 17:06:44.768980 130335 grpc_riva_asr.cc:905] ASRService.Recognize diarization called.
I0401 17:06:44.769042 130335 riva_asr_stream.cc:226] Detected format: encoding = 1 RAW numchannels = 1 samplerate = 16000 bitspersample = 16
E0401 17:06:44.769136 130335 grpc_riva_asr.cc:957] Error: Unavailable diarizer model requested given these parameters: pipeline_type=diarizer; type=offline;
I0401 17:06:44.800679 518 stats_builder.h:100] {“specversion”:“1.0”,“type”:“riva.asr.recognize.v1”,“source”:“”,“subject”:“”,“id”:“52b126d6-0e8f-4efc-8048-c40c01ae8d09”,“datacontenttype”:“application/json”,“time”:“2025-04-01T17:06:44.767895393+00:00”,“data”:{“release_version”:“2.19.0”,“customer_uuid”:“”,“ngc_org”:“”,“ngc_team”:“”,“ngc_org_team”:“”,“container_uuid”:“”,“language_code”:“en-US”,“request_count”:1,“audio_duration”:0.0,“speech_duration”:0.0,“status”:3,“err_msg”:"Error: Unavailable diarizer model requested given these parameters: pipeline_type=diarizer; type=offline; "}}

My code (works well before adding the speaker diarization line).

On my Flask web app, I use webRTC JS lib to capture voice and convert to wav format then send to Riva gRPC api by chunks. The chunk are valid wav files.

class ASRManager:
#def init(self, riva_host=“172.30.1.76:51051”, sample_rate=16000, chunk_size=1600):
def init(self, riva_host=“172.30.1.79:50051”, sample_rate=16000, chunk_size=1600):
self.sample_rate = sample_rate
self.chunk_size = chunk_size
self.auth = Auth()
self.auth.channel = grpc.insecure_channel(riva_host)
self.asr_service = ASRService(self.auth)

    self.recognition_config = RecognitionConfig(
        language_code="en-US",
        max_alternatives=1,
        profanity_filter=False,
        enable_automatic_punctuation=True,
        encoding=AudioEncoding.LINEAR_PCM,
        sample_rate_hertz=sample_rate
    )
    add_speaker_diarization_to_config(self.recognition_config, diarization_enable=True, diarization_max_speakers=3)

def offline_recognize_chunk(self, audio: AudioSegment) -> str:
    samples = audio.set_channels(1).set_frame_rate(16000).get_array_of_samples()
    byte_content = samples.tobytes()
    response = self.asr_service.offline_recognize(byte_content, self.recognition_config)
    if response.results and response.results[0].alternatives:
        return response.results[0].alternatives[0].transcript
    return ""

okay so the NV team pointed me to a link that I have looked:

First I modified the config.sh to enable diarization support then run config.sh and start the riva v2.19 container.

asr_acoustic_model=(“parakeet_1.1b”)
asr_accessory_model=(“diarizer”)

Then I need to use streaming_response_generator() to get speaker diarization working.

def transcribe_stream(self, audio_generator, callback):
    config = StreamingRecognitionConfig(
        config=RecognitionConfig(
            language_code="en-US",
            encoding=AudioEncoding.LINEAR_PCM,
            sample_rate_hertz=44100,
            max_alternatives=1,
            enable_automatic_punctuation=True,
            enable_word_time_offsets=True,
        ),
        interim_results=False,
        # single_utterance=False
    )

    # ✅ Add speaker diarization support
    add_speaker_diarization_to_config(
        config.config, 
        diarization_enable=True,
        diarization_max_speakers=10)

    responses = self.asr_service.streaming_response_generator(
        audio_chunks=audio_generator,
        streaming_config=config
    )
    print("#####responses")
    print(responses)
    for response in responses:
        print("#####response")
        print(response)
        if response.results:
            words = response.results[0].alternatives[0].words
            transcript_with_speakers = " ".join(
                [f"[S{w.speaker_tag}] {w.word}" for w in words]
            )
            callback(transcript_with_speakers, is_final=True)

Now I need to figure out how to improve speaker identification. The out of the box performance is not getting speakers identified correctly:

[S0] I [S0] could [S0] just [S0] get [S0] it.

[S0] Poor

[S0] Thank [S0] you.

[S0] Okay.

[S0] You [S0] need [S0] to [S0] go

[S0] Do [S0] it

[S0] Never

[S0] One

[S0] Okay.

[S0] You [S0] know.

[S0] You [S0] know.

[S0] Yeah.

[S0] I [S0] think [S0] the

[S0] Can [S0] you [S0] call [S0] in?

Hi @playwithai were you able to get improved speaker identification? If not are you able to share the input audio so we can triage and check the results on our side? Thanks