Using ASR Riva as part of embedded control software on a Jetson Xavier

Operating System : Ubuntu 20.04

Riva Version:2.14.0

Jetson Xavier NX

Jetpack 5.1

I have been testing the excellent ASR features of Riva using the transcribe_mic.py and have some questions about potential deployment within our embedded system. Basically, I would like to know if Riva has too much CPU overhead to run alongside our already control software and also how the docker deployment would integrate with our services.

A short summary of my existing testing and requirements is below.

With only the service_enabled_asr=true flag enabled in config.sh file this uses the following models.

models_nlp_punctuation_bert_base_en_us_v2.14.0-tegra-xavier
models_asr_conformer_en_us_str_v2.14.0-tegra-xavier

I would like to use ASR from a microphone as part of a control mechanism for our embedded software running on a Jetson Xavier. I will not go into the details of our system for confidentiality reasons however, there will be a small specific command set, perhaps 50 commands with command keywords and decimal values.

This will be input into a c++ application which will perform the command interpretation and execution. This application will not run in docker. I have read the user guide but it is not clear to me how to approach this task. I therefore have some questions.

  • How would a client application running within docker interact with an embedded application running outside docker?
  • For my initial test, I downloaded over 15GB of data into my docker container, therefore this requires and external drive on my Jetson Xavier. Is this necessary? This may or may not be an issue, however it would be useful to know if this can be reduced for our case, i.e. a relatively small set of commands.
  • Similarly, executing riva_start.sh takes a few minutes to complete. This is not practical for our application, which typically takes around 30s to start. Is there a way to speed this up?
  • Is there a C/C++ ASR example for detecting microphone input? I’m thinking that it may be easier to pipe the output from one C++ application to another C++ application, rather than from a Python app. I noticed the riva_asr_client application which takes an audio file input, however there is no mention of microphone input.

Apologies if these questions are vague and Naïve. I’m new to this and at the evaluation stage. There doesn’t seem to be much mention of ASR from a microphone in the documentation.

When running the transcribe_mic.py example, I find the performance is excellent and can see a paragraph of text accurately displayed, however, I’m wondering if the overhead is too much to run alongside our other high performance embedded application.

Thanks for your advice.