Description
I employed Nvidia RIVA in order to run offline ASR on Arabic conversations with the goal of getting transcriptions including diacritics. Although RIVA Conformer ASR Arabic claims to provide “diacritics along with spaces” this is not the case in practice. That is, the resulting transcriptions include Arabic text without diacritics.
RIVA Conformer ASR Arabic model version: nvidia/riva/speechtotext_ar_ar_conformer:deployable_v3.0_export_v2
RIVA image: nvcr.io/nvidia/riva/riva-speech:2.18.0
How to reproduce
I have employed 2 approaches for getting transcriptions with Arabic diacritics without success.
First approach (Pretrained ASR Model)
First approach follows the steps of the RIVA guide and invokes Pretrained ASR Models for Arabic language (see Arabic (ar-AR)
entry). The only essential change that was needed was in config.sh
:
service_enabled_asr=true
service_enabled_nlp=false
service_enabled_tts=false
service_enabled_nmt=true
<...>
asr_language_code=("ar-AR")
Although this approach went smooth, the resulting transcriptions include Arabic text without diacritics. Some usage examples:
riva_asr_client --list_models
'ar-AR': 'conformer-ar-AR-asr-offline-asr-bls-ensemble'
riva_asr_client --audio_file=/opt/riva/wav/ar-AR_sample.wav --language_code=ar-AR --print_transcripts=true --automatic_punctuation=true --verbatim_transcripts=true
Loading eval dataset...
filename: /opt/riva/wav/ar-AR_sample.wav
Done loading 1 files
-----------------------------------------------------------
File: /opt/riva/wav/ar-AR_sample.wav
Final transcripts:
0 : هل بإمكانك أن تعطيني المزيد من القهوة من فضلك؟
Word Start (ms) End (ms) Confidence
هل 640 800 4.9614e-01
بإمكانك 880 1480 4.0296e-01
أن 1640 1680 8.0560e-01
تعطيني 1720 2160 9.4022e-01
<...>
Second approach (Build and Deploy ASR Model)
Given that RIVA Conformer ASR Arabic provides only the .riva
file, the second attempt includes the building and deployment of the model in order to acquire the .rmir
that can be used later for inference. By following the instructions to assembly the command line in Pipeline Configuration and downloading the respective additional data (wfst_tokenizer_model
, wfst_verbalizer_model
, decoding_language_model_binary
, decoding_vocab
) I could successfully generate the .rmir
file (the additional data can be manually retrieved from Pretrained ASR Models for Arabic language).
Following, I followed again the procedure from the quick start guide (riva_clean.h
, riva_init.sh
, riva_start.sh
, riva_start_client.sh
) with no success (I excluded the model download part from riva_init.sh
since I manually generated the .rmir
). The ouput was identical to what I mention earlier.
riva-build command line:
riva-build speech_recognition \
<rmir_filename>:<key> \
<riva_file>:<key> \
--offline \
--name=conformer-ar-AR-asr-offline \
--return_separate_utterances=True \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--ms_per_timestep=40 \
--endpointing.start_history=200 \
--nn.fp16_needs_obey_precision_pass \
--endpointing.residue_blanks_at_start=-2 \
--chunk_size=4.8 \
--left_padding_size=1.6 \
--right_padding_size=1.6 \
--max_batch_size=16 \
--featurizer.max_batch_size=512 \
--featurizer.max_execution_batch_size=512 \
--decoder_type=flashlight \
--decoding_language_model_binary=<bin_file> \
--decoding_vocab=<txt_file> \
--flashlight_decoder.lm_weight=0.7 \
--flashlight_decoder.word_insertion_score=0.75 \
--flashlight_decoder.beam_threshold=20. \
--language_code=ar-AR \
--wfst_tokenizer_model=<far_tokenizer_file> \
--wfst_verbalizer_model=<far_verbalizer_file>
Hardware specs
Tesla V100-PCIE-32GB
Driver Version: 560.35.03 CUDA Version: 12.6
RIVA version: 2.18.0
Intel(R) Xeon(R) Gold 6278C CPU @ 2.60GHz
Did anyone observe the same issue?