[jetson-voice] ASR/NLP/TTS for Jetson

I can confirm there is a problem with the trained model.
I managed to run NeMo transcribe_speech.py on the Nano and it works fine so I’ll need to change the approach for training my model.

I’m not as lucky getting the model training running on the Nano but I’ll need to try a few more things. I know it is not ideal but if I get to use the GPU it might make is worth it for transfer learning at least. I’ve done this before with the jetson-inference project.

Now I’m looking into updating the num_workers for the dataloader to see if I can make it not time out :-)=

Validation sanity check: 0it [00:00, ?it/s][NeMo W 2022-08-12 10:20:26 nemo_logging:349] /usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/data_loading.py:133: UserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
      f"The dataloader, {name}, does not have many workers which may be a bottleneck."

OK gotcha - I believe the number of workers is set through the OmegaConf config structure when you create the NeMo trainer, but not sure. Regardless I don’t think that should impact the actual convergence of the model, just the training speed perhaps.

hello @dusty_nv I’m trying to run jetson-voice asr.py AastaLLL help me run the docker but i receive this error when launching i read througt the forum and found nothing that could help me looked in google too but don’t find answers i tried my mic and run a demo everything work fine

root@jarvis-desktop:/jetson-voice/examples# python3 asr.py
Namespace(debug=False, default_backend=‘tensorrt’, global_config=None, list_devices=False, list_models=False, log_level=‘info’, mic=None, model=‘quartznet’, model_dir=‘/jetson-voice/data/networks’, model_manifest=‘/jetson-voice/data/networks/manifest.json’, profile=False, verbose=False, wav=None)
[NeMo W 2023-04-15 02:52:30 nemo_logging:349] /usr/local/lib/python3.6/dist-packages/pydub/utils.py:170: RuntimeWarning: Couldn’t find ffmpeg or avconv - defaulting to ffmpeg, but may not work
warn(“Couldn’t find ffmpeg or avconv - defaulting to ffmpeg, but may not work”, RuntimeWarning)

################################################################################

WARNING, path does not exist: KALDI_ROOT=/mnt/matylda5/iveselyk/Tools/kaldi-trunk

(please add ‘export KALDI_ROOT=<your_path>’ in your $HOME/.profile)

(or run as: KALDI_ROOT=<your_path> python <your_script>.py)

################################################################################

[NeMo I 2023-04-15 02:52:30 features:264] PADDING: 0
[NeMo I 2023-04-15 02:52:30 features:281] STFT using torch
[NeMo W 2023-04-15 02:52:30 nemo_logging:349] /usr/local/lib/python3.6/dist-packages/nemo_toolkit-1.6.2-py3.6.egg/nemo/collections/asr/parts/preprocessing/features.py:314: FutureWarning: Pass sr=16000, n_fft=512 as keyword args. From version 0.10 passing these as positional arguments will result in an error
librosa.filters.mel(sample_rate, self.n_fft, n_mels=nfilt, fmin=lowfreq, fmax=highfreq), dtype=torch.float

[2023-04-15 02:52:31] resource.py:114 - loading model ‘/jetson-voice/data/networks/asr/quartznet-15x5_en/quartznet.onnx’ with jetson_voice.backends.tensorrt.TRTModel
[2023-04-15 02:52:31] trt_model.py:41 - loading cached TensorRT engine from /jetson-voice/data/networks/asr/quartznet-15x5_en/quartznet.engine
[04/15/2023-02:52:35] [TRT] [I] [MemUsageChange] Init CUDA: CPU +225, GPU +0, now: CPU 333, GPU 3573 (MiB)
[04/15/2023-02:52:35] [TRT] [I] Loaded engine size: 42 MiB
[04/15/2023-02:52:37] [TRT] [V] Using cublas as a tactic source
[04/15/2023-02:52:37] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +158, GPU +207, now: CPU 535, GPU 3869 (MiB)
[04/15/2023-02:52:37] [TRT] [V] Using cuDNN as a tactic source
[04/15/2023-02:52:41] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +240, GPU +20, now: CPU 775, GPU 3889 (MiB)
[04/15/2023-02:52:41] [TRT] [V] Deserialization required 5362705 microseconds.
[04/15/2023-02:52:41] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +41, now: CPU 0, GPU 41 (MiB)
[04/15/2023-02:52:41] [TRT] [V] Using cublas as a tactic source
[04/15/2023-02:52:41] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +0, now: CPU 733, GPU 3847 (MiB)
[04/15/2023-02:52:41] [TRT] [V] Using cuDNN as a tactic source
[04/15/2023-02:52:41] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 733, GPU 3847 (MiB)
[04/15/2023-02:52:41] [TRT] [V] Total per-runner device persistent memory is 36948992
[04/15/2023-02:52:41] [TRT] [V] Total per-runner host persistent memory is 282384
[04/15/2023-02:52:41] [TRT] [V] Allocated activation device memory of size 1076224
[04/15/2023-02:52:41] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +36, now: CPU 0, GPU 77 (MiB)
[2023-04-15 02:52:41] trt_model.py:59 - loaded TensorRT engine from /jetson-voice/data/networks/asr/quartznet-15x5_en/quartznet.engine

binding 0 - ‘audio_signal’
input: True
shape: (1, 64, -1)
dtype: DataType.FLOAT
size: -256
dynamic: True
profiles: [{‘min’: (1, 64, 10), ‘opt’: (1, 64, 150), ‘max’: (1, 64, 300)}]

binding 1 - ‘logprobs’
input: False
shape: (1, -1, 29)
dtype: DataType.FLOAT
size: -116
dynamic: True
profiles:

[2023-04-15 02:52:42] ctc_beamsearch.py:51 - creating CTCBeamSearchDecoder
[2023-04-15 02:52:42] ctc_beamsearch.py:52 - {‘add_punctuation’: True,
‘alpha’: 0.7,
‘beam_width’: 32,
‘beta’: 0.0,
‘cutoff_prob’: 1.0,
‘cutoff_top_n’: 40,
‘language_model’: ‘/jetson-voice/data/networks/asr/quartznet-15x5_en/lm.bin’,
‘timestep_offset’: 5,
‘top_k’: 3,
‘type’: ‘beamsearch’,
‘vad_eos_duration’: 0.65,
‘word_threshold’: -1000.0}
[2023-04-15 02:52:49] asr_engine.py:128 - CTC decoder type: ‘beamsearch’
Traceback (most recent call last):
File “asr.py”, line 30, in
chunk_size=asr.chunk_size)
File “/jetson-voice/jetson_voice/utils/audio.py”, line 67, in AudioInput
raise ValueError(‘either wav or mic argument must be specified’)
ValueError: either wav or mic argument must be specified

KALDI_ROOT=<root/jarvis/desktop/jetson-voice/examples> python .py
no directory

@Eva01 you can ignore those warnings about Kaldi. To run asr.py, you need to either specify a --wav file or --mic device ID to use, like shown here: https://github.com/dusty-nv/jetson-voice#automatic-speech-recognition-asr

thank you very much it work @dusty_nv @linuxdev told me you know about everything of Ai so i would like to take the chance to ask you if you know how can i train it as a offline voice commande i saw it use MatchboxNet and Google Speech Commands but how can i change the command and train it cause i want to lauch it from code-oss and i want the code to have access to my gpio for servo-motor and other things.

Hi @Eva01, it is already doing speech recognition / speech commands offline (i.e. all the speech processing is done locally onboard Nano using DNNs). To integrate your GPIO and other peripherals, you would modify the asr.py example or add it to your own script. You can start the container in --dev mode to make editing easier.

I haven’t had to retrain/finetune the ASR models, but if you want to add your own speech commands to Matchboxnet, you can do it in NeMo: https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/Speech_Commands.ipynb#scrollTo=I62_LJzc-p2b

hello @dusty_nv sorry for the late reply and the spam i read the documentation and i think i begin to understand the process but where do i have to download the voice for example a want to open servo i need save me saying servo using some python file and download the dataset in one of the program i think its nemo_export onnx.py after that i need to use nemo_export onnx.py and train it on nemo_train_intent.py and after the model is make i need to introduce it to asr.py ???

i read on the documentation to download the dataset a script is provided under the NeMo root directory scripts sub-directory but i don’t find such directory on scripts

The nemo_train_intent.py is to train a Transformer model (like BERT or Distilbert) as an intent/slot classifier for NLP - which would typically be a good use-case for what you are doing. However if you’re on Jetson Nano, you may not have enough memory to run ASR and NLP at the same time, you would have to try.

Since your commands are limited in scope, I would just use the stock ASR model, and do basic string parsing / regex on that to find the commands. Then if you need to re-train models later you can. I haven’t trained my own ASR or speech command models. I think it’s easier to just use the included ASR model (the full ASR, not speech command) and then do your own NLP or train an intent/slot classifier for it.

thank you i will do more search on this and open new topic when i will understand programmeur slang

i have one last question and its not about jetson-voice i try the asr.py and i need to repeat myself many time when trying to say jarvis so i need to train it i would like to try another method by the time i get better and train my own asr model. the method is picovoice porcupine but no mater what i tried my acceskey don’t work i already trained the wake up world and have it as a file but i cannot open cause the file is a unknow type

Hi @Eva01, sorry about that, if you are using Xavier/Orin you could run the actual RIVA ASR backend (which has better accuracy). I’m not familiar with picovoice and haven’t used it, so I would recommend contacting their support if you have trouble running/installing it or using their API keys.

thank you for the help :))

Is it possible to use custom voice model with jetson-voice tts engine ?

@sangeethagr2018 at this point I’d just recommend using Riva directly, and you can fine-tune your own speech models in NeMo and export them to Riva:

https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tutorials/tts-finetune-nemo.html

In /jetson_voice_ros/asr.py source code.
If user specify model:=matchboxnet
It raises error message:
“jetson_voice_ros/asr node does not support ASR classification models”
Is this by design?

ros2 launch ./ros/launch/asr.launch.py model:=matchboxnet input_device:=11

In source code : jetson-voice/ros/jetson_voice_ros/asr.py

load the ASR model

    self.asr = ASR(self.model_name)
    self.get_logger().info(f"model '{self.model_name}' ready")
    
    if self.asr.classification:
        raise ValueError(f'jetson_voice_ros/asr node does not support ASR classification models')

@jenhungho yes that ASR ROS node only supports transcription (audio-to-text) not classification (audio-to-class). You could make a ROS node that did that though.

1 Like

@dusty_nv Does Jetson-voice library support C language also? I have found all scripts in python only. Please guide me if it is also available officially in C.

Thanks

@deepanshu.pandey it’s Python, and for JetPack 4. For JetPack 5, there are a number of new tutorials/libraries/containers available here:

1 Like