Riva ASR Not Recognizing Speech (Empty Transcript)

Greetings,

Problem Description
I’m experiencing an issue with Nvidia Riva ASR where the server receives audio data via gRPC and detects the audio duration, but it fails to recognize any speech and always returns an empty transcript to the client.

Below is a log snippet from the Riva server:

grpc_riva_asr.cc:685] ASRService.Recognize called.
grpc_riva_asr.cc:854] Using model conformer-en-US-asr-offline-asr-bls-esemble from Triton localhost:8001 for inference
status_builder.h:100] {"specversion":"1.0", "type": "riva.asr.recognize.v1", "source":"","subject":"", "id": "df8df5d5-72bb-4ff11-b918-f19b0p477712", "datacontenttype":"application/son","time":"2025-02-21T11:30:40.061173945+00:00","data":{"release_version":"2.18.0","customer_uuid":"","ngc_org":"","ngc_team":"","ngc_org_team":"","container_uuid":"","language_code":"en-US","request count":1,"audio_duration":4.000805421311011, "speech_duration":"0.0, "status":0,"err msg":""}}

The key issue here is that speech_duration remains 0.0, indicating that the server does not detect any actual speech in the provided audio.

Troubleshooting Steps Taken

To identify the cause, I performed the following diagnostics:

1. Verified Audio Data Integrity

  • The recorded voice was converted into signed 16-bit PCM (RAW buffer), as required by Riva’s Linear PCM audio encoding.
  • The byte array was converted back to audio and played successfully, confirming that the data was not corrupted.

2. Adjusted VAD (Voice Activity Detection) Settings

  • Increased stop_history and stop_history_eou to 2000 ms.
  • Lowered stop_threshold and stop_threshold_eou to 0.1, ensuring that Riva does not prematurely terminate recognition.

3. Confirmed Riva’s General Functionality

  • Tested local Riva ASR by transcribing a sample WAV file—the speech was correctly recognized.
  • Tested the remote client audio data with both Recognize() and StreamingRecognize() APIs—neither produced a valid transcript.

4. Verified gRPC Data Transmission

  • Successfully tested gRPC communication by using Riva TTS.
  • The synthesized voice was generated, transmitted via gRPC, and played correctly on the client side, confirming that the gRPC pathway is functional.

System Information

  • GPU: Nvidia A10
  • Operating System: Ubuntu
  • Riva Version: riva_quickstart_v2.18.0

Request for Assistance

Given that:

  1. The audio data is properly formatted as 16-bit signed PCM.
  2. The VAD settings are adjusted to avoid premature endpointing.
  3. Riva ASR works fine with local samples, but fails with streamed data.
  4. gRPC communication is functioning as confirmed via Riva TTS.

What could be causing Riva ASR to fail in recognizing speech from the remote client? Are there additional debugging steps or configurations I should check?

Any insights would be greatly appreciated!

1 Like

having the same problem, with identical configuration

Hi @Benutzer1925 , Can you please help us with the server logs ?
Also the config, model used, sample audio and repro steps ?
This will help us with the debugging process.

Thanks

Greetings,

I am pleased to share that the issue has been successfully resolved. After cross-examining my client with a simplified version based on the official GitHub examples, I discovered that the problem stemmed from my client’s inability to properly retrieve the transcript from the response.

It turns out that the server had been sending valid transcripts all along. I had either misinterpreted the speech_duration variable—assuming it represented the duration of detected speech in my audio—or the logging of this variable was not functioning as expected.

This conclusion is supported by comparing the server logs and the exchanged data between the server and both clients. When ensuring that both the defective client and the working client sent the exact same audio data, they both triggered identical server log entries, such as:

"audio_duration":4.000805421311011, "speech_duration":"0.0", "status":0, "err_msg":""

While the defective client failed to display the transcript from the valid response, the working client successfully printed the full transcript without any issues.

After correcting the defective client, it now transmits audio data and receives transcripts as expected.

Thank you for your time and support.