RIVA Conformer ASR Arabic does not provide diacritics

Description

I employed Nvidia RIVA in order to run offline ASR on Arabic conversations with the goal of getting transcriptions including diacritics. Although RIVA Conformer ASR Arabic claims to provide “diacritics along with spaces” this is not the case in practice. That is, the resulting transcriptions include Arabic text without diacritics.

RIVA Conformer ASR Arabic model version: nvidia/riva/speechtotext_ar_ar_conformer:deployable_v3.0_export_v2
RIVA image: nvcr.io/nvidia/riva/riva-speech:2.18.0

How to reproduce

I have employed 2 approaches for getting transcriptions with Arabic diacritics without success.

First approach (Pretrained ASR Model)

First approach follows the steps of the RIVA guide and invokes Pretrained ASR Models for Arabic language (see Arabic (ar-AR) entry). The only essential change that was needed was in config.sh:

service_enabled_asr=true
service_enabled_nlp=false
service_enabled_tts=false
service_enabled_nmt=true
<...>
asr_language_code=("ar-AR")

Although this approach went smooth, the resulting transcriptions include Arabic text without diacritics. Some usage examples:

riva_asr_client  --list_models
'ar-AR': 'conformer-ar-AR-asr-offline-asr-bls-ensemble'
riva_asr_client --audio_file=/opt/riva/wav/ar-AR_sample.wav --language_code=ar-AR  --print_transcripts=true --automatic_punctuation=true --verbatim_transcripts=true

Loading eval dataset...
filename: /opt/riva/wav/ar-AR_sample.wav
Done loading 1 files
-----------------------------------------------------------
File: /opt/riva/wav/ar-AR_sample.wav

Final transcripts: 
0 : هل بإمكانك أن تعطيني المزيد من القهوة من فضلك؟ 

Word                                    Start (ms)      End (ms)        Confidence      
هل                                    640             800             4.9614e-01      
بإمكانك                          880             1480            4.0296e-01      
أن                                    1640            1680            8.0560e-01      
تعطيني                            1720            2160            9.4022e-01      
<...>

Second approach (Build and Deploy ASR Model)

Given that RIVA Conformer ASR Arabic provides only the .riva file, the second attempt includes the building and deployment of the model in order to acquire the .rmir that can be used later for inference. By following the instructions to assembly the command line in Pipeline Configuration and downloading the respective additional data (wfst_tokenizer_model, wfst_verbalizer_model, decoding_language_model_binary, decoding_vocab) I could successfully generate the .rmir file (the additional data can be manually retrieved from Pretrained ASR Models for Arabic language).

Following, I followed again the procedure from the quick start guide (riva_clean.h, riva_init.sh, riva_start.sh, riva_start_client.sh) with no success (I excluded the model download part from riva_init.sh since I manually generated the .rmir). The ouput was identical to what I mention earlier.

riva-build command line:

riva-build speech_recognition \
  <rmir_filename>:<key> \
  <riva_file>:<key> \
  --offline \
  --name=conformer-ar-AR-asr-offline \
  --return_separate_utterances=True \
  --featurizer.use_utterance_norm_params=False \
  --featurizer.precalc_norm_time_steps=0 \
  --featurizer.precalc_norm_params=False \
  --ms_per_timestep=40 \
  --endpointing.start_history=200 \
  --nn.fp16_needs_obey_precision_pass \
  --endpointing.residue_blanks_at_start=-2 \
  --chunk_size=4.8 \
  --left_padding_size=1.6 \
  --right_padding_size=1.6 \
  --max_batch_size=16 \
  --featurizer.max_batch_size=512 \
  --featurizer.max_execution_batch_size=512 \
  --decoder_type=flashlight \
  --decoding_language_model_binary=<bin_file> \
  --decoding_vocab=<txt_file> \
  --flashlight_decoder.lm_weight=0.7 \
  --flashlight_decoder.word_insertion_score=0.75 \
  --flashlight_decoder.beam_threshold=20. \
  --language_code=ar-AR \
  --wfst_tokenizer_model=<far_tokenizer_file> \
  --wfst_verbalizer_model=<far_verbalizer_file>

Hardware specs

Tesla V100-PCIE-32GB
Driver Version: 560.35.03      CUDA Version: 12.6
RIVA version: 2.18.0

Intel(R) Xeon(R) Gold 6278C CPU @ 2.60GHz

Did anyone observe the same issue?

@jkh As for diacritics, the Riva Arabic ASR model supports diacritics but not every speech is transcribed with full diacritics, it depends on context. If the audio is in Modern Standard Arabic or dialectal speech then the model will provide only partial diacritics where the context is ambiguous or diacritics will aid meaning such as shaddah or tanween. If the speech is Quran, then the model will produce fully diacritized transcripts with harkat. So, the answer is yes diacritics are supported as a feature by the model but it’s context dependent.

1 Like

Hi @ealbasiri and thank you for your answer. I found some quran archives that i can try soon. Is there another useful audio source that I can put to test? What other Arabic audio categories you have in mind?

@jkh you can try any Arabic benchmark datasets available. This is a recent publication for an Open Universal Arabic ASR Leaderboard, a continuous benchmark project for open-source general Arabic ASR models across various multi-dialect datasethere. Their results show that Riva Arabic ASR ranks first on the leaderboard.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.