Greetings,
Problem Description
I’m experiencing an issue with Nvidia Riva ASR where the server receives audio data via gRPC and detects the audio duration, but it fails to recognize any speech and always returns an empty transcript to the client.
Below is a log snippet from the Riva server:
grpc_riva_asr.cc:685] ASRService.Recognize called.
grpc_riva_asr.cc:854] Using model conformer-en-US-asr-offline-asr-bls-esemble from Triton localhost:8001 for inference
status_builder.h:100] {"specversion":"1.0", "type": "riva.asr.recognize.v1", "source":"","subject":"", "id": "df8df5d5-72bb-4ff11-b918-f19b0p477712", "datacontenttype":"application/son","time":"2025-02-21T11:30:40.061173945+00:00","data":{"release_version":"2.18.0","customer_uuid":"","ngc_org":"","ngc_team":"","ngc_org_team":"","container_uuid":"","language_code":"en-US","request count":1,"audio_duration":4.000805421311011, "speech_duration":"0.0, "status":0,"err msg":""}}
The key issue here is that speech_duration
remains 0.0, indicating that the server does not detect any actual speech in the provided audio.
Troubleshooting Steps Taken
To identify the cause, I performed the following diagnostics:
1. Verified Audio Data Integrity
- The recorded voice was converted into signed 16-bit PCM (RAW buffer), as required by Riva’s Linear PCM audio encoding.
- The byte array was converted back to audio and played successfully, confirming that the data was not corrupted.
2. Adjusted VAD (Voice Activity Detection) Settings
- Increased
stop_history
andstop_history_eou
to 2000 ms. - Lowered
stop_threshold
andstop_threshold_eou
to 0.1, ensuring that Riva does not prematurely terminate recognition.
3. Confirmed Riva’s General Functionality
- Tested local Riva ASR by transcribing a sample WAV file—the speech was correctly recognized.
- Tested the remote client audio data with both Recognize() and StreamingRecognize() APIs—neither produced a valid transcript.
4. Verified gRPC Data Transmission
- Successfully tested gRPC communication by using Riva TTS.
- The synthesized voice was generated, transmitted via gRPC, and played correctly on the client side, confirming that the gRPC pathway is functional.
System Information
- GPU: Nvidia A10
- Operating System: Ubuntu
- Riva Version:
riva_quickstart_v2.18.0
Request for Assistance
Given that:
- The audio data is properly formatted as 16-bit signed PCM.
- The VAD settings are adjusted to avoid premature endpointing.
- Riva ASR works fine with local samples, but fails with streamed data.
- gRPC communication is functioning as confirmed via Riva TTS.
What could be causing Riva ASR to fail in recognizing speech from the remote client? Are there additional debugging steps or configurations I should check?
Any insights would be greatly appreciated!