I fine-tuned “speechtotext_english_citrinet_1024.tlt” model using TAO toolkit for en-us language code, then exported it as “asr-model.riva” in “RIVA” format, again by using TAO toolkit.
[NeMo I 2022-06-28 13:04:47 export:75] Experiment logs saved to '/results'
[NeMo I 2022-06-28 13:04:47 export:76] Exported model to '/results/asr-model.riva'
[NeMo I 2022-06-28 13:04:48 export:83] Exported model is compliant with Riva
For deploying our model, i used speech-to-text-deployment.ipynb instructions. For converting .riva to .rmir :
!docker run --rm --gpus all -v $MODEL_LOC:/data nvcr.io/nvidia/riva/riva-speech:2.2.1-servicemaker -- \
riva-build speech_recognition -f /data/asr.rmir:$KEY /data/asr-model.riva:$KEY --offline \
--decoder_type=greedy
2022-06-28 15:44:06,590 [WARNING] Property 'binary' is deprecated. Please use the callback system instead.
2022-06-28 15:44:10,781 [INFO] Packing binaries for nn/ONNX : {'onnx': ('nemo.collections.asr.models.ctc_bpe_models.EncDecCTCModelBPE', 'model_graph.onnx')}
2022-06-28 15:44:10,781 [INFO] Copying onnx:model_graph.onnx -> nn:nn-model_graph.onnx
2022-06-28 15:44:23,547 [INFO] Packing binaries for lm_decoder/ONNX : {'vocab_file': '/tmp/tmpno0x5ykt/riva_decoder_vocabulary.txt', 'tokenizer_model': ('nemo.collections.asr.models.ctc_bpe_models.EncDecCTCModelBPE', 'tokenizer.model')}
2022-06-28 15:44:23,547 [INFO] Copying vocab_file:/tmp/tmpno0x5ykt/riva_decoder_vocabulary.txt -> lm_decoder:lm_decoder-riva_decoder_vocabulary.txt
2022-06-28 15:44:23,547 [INFO] Copying tokenizer_model:tokenizer.model -> lm_decoder:lm_decoder-tokenizer.model
2022-06-28 15:44:23,548 [INFO] Packing binaries for rescorer/ONNX : {'vocab_file': '/tmp/tmpno0x5ykt/riva_decoder_vocabulary.txt'}
2022-06-28 15:44:23,548 [INFO] Copying vocab_file:/tmp/tmpno0x5ykt/riva_decoder_vocabulary.txt -> rescorer:rescorer-riva_decoder_vocabulary.txt
2022-06-28 15:44:23,549 [INFO] Packing binaries for vad/ONNX : {'vocab_file': '/tmp/tmpno0x5ykt/riva_decoder_vocabulary.txt'}
2022-06-28 15:44:23,549 [INFO] Copying vocab_file:/tmp/tmpno0x5ykt/riva_decoder_vocabulary.txt -> vad:vad-riva_decoder_vocabulary.txt
2022-06-28 15:44:23,549 [INFO] Saving to /data/asr.rmir
Then for deploying it, at first I changed config.sh
of riva_quickstart_v2.2.1
by setting $riva_model_loc
parameter to the defined $MODEL_LOC
path and $use_existing_rmirs
flag to true
. Also, I copied manually asr.rmir
to $riva_model_loc/rmir
.
After that, I run riva_init.sh to deploy my asr.rmir
model:
2022-06-28 15:47:17,434 [INFO] Writing Riva model repository to '/data/models'...
2022-06-28 15:47:17,434 [INFO] The riva model repo target directory is /data/models
2022-06-28 15:47:24,667 [INFO] Using tensorrt with fp16
2022-06-28 15:47:24,667 [INFO] Extract_binaries for nn -> /data/models/riva-trt-riva-asr-am-streaming-offline/1
2022-06-28 15:47:24,667 [INFO] extracting {'onnx': ('nemo.collections.asr.models.ctc_bpe_models.EncDecCTCModelBPE', 'model_graph.onnx')} -> /data/models/riva-trt-riva-asr-am-streaming-offline/1
2022-06-28 15:47:34,475 [INFO] Printing copied artifacts:
2022-06-28 15:47:34,475 [INFO] {'onnx': '/data/models/riva-trt-riva-asr-am-streaming-offline/1/model_graph.onnx'}
2022-06-28 15:47:34,476 [INFO] Building TRT engine from ONNX file
[W] colored module is not installed, will not use colors when logging. To enable colors, please install the colored module: python3 -m pip install colored
[W] 'Shape tensor cast elision' routine failed with: None
[libprotobuf WARNING /home/jenkins/agent/workspace/OSS/OSS_L0_MergeRequest/oss/build/third_party.protobuf/src/third_party.protobuf/src/google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 564516907
[06/28/2022-15:47:41] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[06/28/2022-15:48:24] [TRT] [W] Output type must be INT32 for shape outputs
[06/28/2022-15:48:24] [TRT] [W] Output type must be INT32 for shape outputs
.
.
.
[06/28/2022-15:48:37] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are:
[06/28/2022-15:48:37] [TRT] [W] (# 0 (SHAPE audio_signal))
[06/28/2022-15:48:37] [TRT] [W] (# 0 (SHAPE length))
[06/28/2022-15:48:47] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are:
[06/28/2022-15:48:47] [TRT] [W] (# 0 (SHAPE audio_signal))
[06/28/2022-15:48:47] [TRT] [W] (# 0 (SHAPE length))
.
.
.
2022-06-28 16:04:46,759 [INFO] Writing engine to model repository: /data/models/riva-trt-riva-asr-am-streaming-offline/1/model.plan
2022-06-28 16:04:47,076 [INFO] Extract_binaries for featurizer -> /data/models/riva-asr-feature-extractor-streaming-offline/1
2022-06-28 16:04:47,078 [INFO] Extract_binaries for vad -> /data/models/riva-asr-voice-activity-detector-ctc-streaming-offline/1
2022-06-28 16:04:47,078 [INFO] extracting {'vocab_file': '/tmp/tmpno0x5ykt/riva_decoder_vocabulary.txt'} -> /data/models/riva-asr-voice-activity-detector-ctc-streaming-offline/1
2022-06-28 16:04:47,079 [INFO] Extract_binaries for lm_decoder -> /data/models/riva-asr-ctc-decoder-cpu-streaming-offline/1
2022-06-28 16:04:47,080 [INFO] extracting {'vocab_file': '/tmp/tmpno0x5ykt/riva_decoder_vocabulary.txt', 'tokenizer_model': ('nemo.collections.asr.models.ctc_bpe_models.EncDecCTCModelBPE', 'tokenizer.model')} -> /data/models/riva-asr-ctc-decoder-cpu-streaming-offline/1
2022-06-28 16:04:47,080 [INFO] {'vocab_file': '/data/models/riva-asr-ctc-decoder-cpu-streaming-offline/1/riva_decoder_vocabulary.txt', 'tokenizer_model': '/data/models/riva-asr-ctc-decoder-cpu-streaming-offline/1/tokenizer.model'}
2022-06-28 16:04:47,081 [INFO] Extract_binaries for riva-asr -> /data/models/riva-asr/1
+ [[ amd64 == \a\r\m\6\4 ]]
+ echo
+ echo 'Riva initialization complete. Run ./riva_start.sh to launch services.'
Riva initialization complete. Run ./riva_start.sh to launch services.
As can be seen, our model is around 570 MB, this process takes 17~18 minutes and it has been done without any error. BUT the deployed models in data/models
:
total 20K
drwxr-xr-x 3 root root 4.0K Jun 28 16:04 riva-asr
drwxr-xr-x 3 root root 4.0K Jun 28 16:04 riva-asr-ctc-decoder-cpu-streaming-offline
drwxr-xr-x 3 root root 4.0K Jun 28 16:04 riva-asr-feature-extractor-streaming-offline
drwxr-xr-x 3 root root 4.0K Jun 28 16:04 riva-asr-voice-activity-detector-ctc-streaming-offline
drwxr-xr-x 3 root root 4.0K Jun 28 16:04 riva-trt-riva-asr-am-streaming-offline
are almost empty.
By runing riva_start.init
i got:
Starting Riva Speech Services. This may take several minutes depending on the number of models deployed.
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Riva server is ready...
which shows that all deployed models are locaded on the Riva server. BUT when i want to send a request to this server for transcribing below audio file by running this code:
audio_file = "example.wav"
server = "localhost:50051"
wf = wave.open(audio_file, 'rb')
with open(audio_file, 'rb') as fh:
data = fh.read()
channel = grpc.insecure_channel(server)
client = rasr_srv.RivaSpeechRecognitionStub(channel)
config = rasr.RecognitionConfig(
sample_rate_hertz=16000,
language_code="en-US",
max_alternatives=5,
enable_automatic_punctuation=False,
audio_channel_count=1,
)
request = rasr.RecognizeRequest(config=config, audio=data)
response = client.Recognize(request)
print(response)
it seems that the loaded models don’t run properly. In fact, audio files are often not transcribed or are just a word or two long:
results {
alternatives {
transcript: "no "
confidence: 1.0
}
channel_tag: 1
audio_processed: 4.800000190734863
}
BUT when I test the fine-tuned model on this audio file using TAO i got:
!tao speech_to_text_citrinet infer \
-e $SPECS_DIR/speech_to_text_citrinet/infer.yaml \
-g 1 \
-k $KEY \
-m $RESULTS_DIR/citrinet/finetune/checkpoints/finetuned-model.tlt \
-r $RESULTS_DIR/citrinet/infer \
file_paths=[$DATA_DIR/example.wav]
Test config :
manifest_filepath: null
sample_rate: 16000
batch_size: 32
shuffle: false
use_start_end_token: false
[NeMo I 2022-06-28 08:44:17 features:255] PADDING: 16
[NeMo I 2022-06-28 08:44:17 features:272] STFT using torch
Transcribing: 100%|███████████████████████████████| 1/1 [00:01<00:00, 1.59s/it]
[NeMo I 2022-06-28 08:44:31 infer:69] The prediction results:
[NeMo I 2022-06-28 08:44:31 infer:71] File: /data/example.wav
[NeMo I 2022-06-28 08:44:31 infer:72] Predicted transcript: tms naers and throat were clear
[NeMo I 2022-06-28 08:44:31 infer:75] Experiment logs saved to '/results/citrinet/infer'
2022-06-28 08:44:33,734 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
that shows my model works correctly but it’s not properly deployed on the Riva server.
I need to know why this happened and what I should do to fix this.
Hardware: GPU T4
Ubuntu: 22.04
Riva Version: 2.2.0