[jetson-voice] ASR/NLP/TTS for Jetson

jetson-voice is an ASR/NLP/TTS deep learning inference library for Jetson Nano, TX1/TX2, Xavier NX, and AGX Xavier. It supports Python and JetPack 4.4.1 or newer. The DNN models were trained with NeMo and deployed with TensorRT for optimized performance. All computation is performed using the onboard GPU.

Currently the following capabilities are included:

The NLP models are using the DistilBERT transformer architecture for reduced memory usage and increased performance. For samples of the text-to-speech output, see the TTS Audio Samples section.

See the GitHub repo here: https://github.com/dusty-nv/jetson-voice

5 Likes

@dusty_nv

I have developed an NLU engine with Tensorflow 2 which we now use on our Hospital Intelligent Automation Server. I used MITIE for NER but it is proving difficult to extract entities correctly when more than two entities are involved and training time for the NLU is small but MITIE training time is hours.

How well does the slot filling work with small amounts of training data? The intents and entities are quite specific to the network. IE: Turn the DEVICE on in the ZONE - tell me the MEASUREMENT in the ZONE etc. I am definitely considering using NEMO for our next release do you think it would function well with around ten to twenty examples per intent ?

EDIT it remove tags.

I’m not exactly sure about the “minimum” training data - for reference, an internal dataset we have has ~75 examples per intent. So you may or may not need more - you can test and add more as necessary. For my models I have also made some dataset-specific Python scripts that automatically generate the dataset based on common phrases/sayings that are permuted and combined.

Training the intent/slot models with NeMo doesn’t take long on a PC/server with discrete GPU, as you typically only need to train it for a handful of epochs. To put it in perspective, it takes about 15 minutes (or less) to train on my laptop with a GeForce 1070M. I use the Distilbert network because it’s significantly faster for inference while maintaining most of the accuracy of BERT-Base.

1 Like

Hi Dusty, thanks for the update. I trained the NeMo intents and slots model for 100 epochs on a 1050ti (I have a 1080 but it is on Windows machine) and it performed quite badly for entities, intents seems fine:

[NeMo I 2021-08-23 15:42:03 model:143] Query : set alarm for seven thirty am
[NeMo I 2021-08-23 15:42:03 model:144] Predicted Intent: alarm_set
[NeMo I 2021-08-23 15:42:03 model:145] Predicted Slots: O O O time time time
[NeMo I 2021-08-23 15:42:03 model:143] Query : lower volume by fifty percent
[NeMo I 2021-08-23 15:42:03 model:144] Predicted Intent: audio_volume_down
[NeMo I 2021-08-23 15:42:03 model:145] Predicted Slots: O O O O O
[NeMo I 2021-08-23 15:42:03 model:143] Query : what is my schedule for tomorrow
[NeMo I 2021-08-23 15:42:03 model:144] Predicted Intent: calendar_query
[NeMo I 2021-08-23 15:42:03 model:145] Predicted Slots: O O O O O date

I also tested some of the IoT samples which also performed badly but checking the dataset the IoT slots are not labelled which would make sense.

[NeMo I 2021-08-23 15:42:03 model:143] Query : switch the light on in the office
[NeMo I 2021-08-23 15:42:03 model:144] Predicted Intent: iot_hue_lighton
[NeMo I 2021-08-23 15:42:03 model:145] Predicted Slots: O O O O O O O
[NeMo I 2021-08-23 15:42:03 model:143] Query : will you turn the lamp off in the bedroom
[NeMo I 2021-08-23 15:42:03 model:144] Predicted Intent: iot_hue_lightoff
[NeMo I 2021-08-23 15:42:03 model:145] Predicted Slots: O O O O device_type O O O house_place
[NeMo I 2021-08-23 15:42:03 model:143] Query : what is the temperature
[NeMo I 2021-08-23 15:42:03 model:144] Predicted Intent: qa_maths
[NeMo I 2021-08-23 15:42:03 model:145] Predicted Slots: O O O O

I have started work on a version of our NLU that uses NeMo and will base on custom dataset. Will keep you informed.

Hi Dusy,
I have a problem when I run asr.py with model matchboxnet in dustynv/jetson-voice:r32.6.1.
This is the command,

$ python3 examples/asr.py --model matchboxnet --wav data/audio/commands.wav

And I got an error

[2021-09-04 01:45:55] audio.py:82 - loading audio 'data/audio/commands.wav'
Traceback (most recent call last):
  File "examples/asr.py", line 34, in <module>
    results = asr(samples)
  File "/jetson-voice/jetson_voice/models/asr/asr_engine.py", line 166, in __call__
    length=torch.as_tensor(self.buffer.size, dtype=torch.int64).unsqueeze(dim=0)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/nemo_toolkit-1.0.0rc1-py3.6.egg/nemo/core/classes/common.py", line 770, in __call__
    outputs = wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/nemo_toolkit-1.0.0rc1-py3.6.egg/nemo/collections/asr/modules/audio_preprocessing.py", line 80, in forward
    processed_signal, processed_length = self.get_features(input_signal, length)
  File "/usr/local/lib/python3.6/dist-packages/nemo_toolkit-1.0.0rc1-py3.6.egg/nemo/collections/asr/modules/audio_preprocessing.py", line 389, in get_features
    features = self.featurizer(input_signal)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torchaudio-0.9.0a0+33b2469-py3.6-linux-aarch64.egg/torchaudio/transforms.py", line 583, in forward
    mel_specgram = self.MelSpectrogram(waveform)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torchaudio-0.9.0a0+33b2469-py3.6-linux-aarch64.egg/torchaudio/transforms.py", line 520, in forward
    specgram = self.spectrogram(waveform)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torchaudio-0.9.0a0+33b2469-py3.6-linux-aarch64.egg/torchaudio/transforms.py", line 122, in forward
    self.return_complex,
  File "/usr/local/lib/python3.6/dist-packages/torchaudio-0.9.0a0+33b2469-py3.6-linux-aarch64.egg/torchaudio/functional/functional.py", line 118, in spectrogram
    spec_f = spec_f.reshape(shape[:-1] + spec_f.shape[-2:])
RuntimeError: shape '[1, 154, 2]' is invalid for input of size 79156
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1534, GPU 3511 (MiB)

Would you please give me some advice to fix, thanks.


I can run it in JetPack 4.5

I am having the same problem using Jetpack 4.6 and a USB mic
:
File “/usr/local/lib/python3.6/dist-packages/torchaudio-0.9.0a0+33b2469-py3.6-linux-aarch64.egg/torchaudio/functional/functional.py”, line 118, in spectrogram
spec_f = spec_f.reshape(shape[:-1] + spec_f.shape[-2:])
RuntimeError: shape ‘[1, 154, 2]’ is invalid for input of size 79156
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1865, GPU 6960 (MiB)

@Danieljyu2n is it also with matchboxnet? If so, can you use the normal ASR (quartznet) until I have a chance to look into this more with JetPack 4.6. Thanks.

Yes only with matchboxnet…ASR works but not very well. It is most likely that the quality of the electret mics is poor and my robot’s proximity and orientation is variable. Are the tools to retrain the model using my hardware and voice within the container?

To retrain the ASR, you would probably want to use x86 PC/server with GPU and use the Nemo container.

I’ve not tried training any of the models from jetson-voice on a Jetson, because they are typically bigger models with larger datasets.

Hi Guys… any progress on the matchboxnet error below?

[2021-12-10 06:19:01] audio.py:82 - loading audio 'data/audio/commands.wav'
Traceback (most recent call last):
  File "examples/asr.py", line 34, in <module>
    results = asr(samples)
  File "/jetson-voice/jetson_voice/models/asr/asr_engine.py", line 166, in __call__
    length=torch.as_tensor(self.buffer.size, dtype=torch.int64).unsqueeze(dim=0)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/nemo_toolkit-1.0.0rc1-py3.6.egg/nemo/core/classes/common.py", line 770, in __call__
    outputs = wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/nemo_toolkit-1.0.0rc1-py3.6.egg/nemo/collections/asr/modules/audio_preprocessing.py", line 80, in forward
    processed_signal, processed_length = self.get_features(input_signal, length)
  File "/usr/local/lib/python3.6/dist-packages/nemo_toolkit-1.0.0rc1-py3.6.egg/nemo/collections/asr/modules/audio_preprocessing.py", line 389, in get_features
    features = self.featurizer(input_signal)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torchaudio-0.9.0a0+33b2469-py3.6-linux-aarch64.egg/torchaudio/transforms.py", line 583, in forward
    mel_specgram = self.MelSpectrogram(waveform)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torchaudio-0.9.0a0+33b2469-py3.6-linux-aarch64.egg/torchaudio/transforms.py", line 520, in forward
    specgram = self.spectrogram(waveform)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torchaudio-0.9.0a0+33b2469-py3.6-linux-aarch64.egg/torchaudio/transforms.py", line 122, in forward
    self.return_complex,
  File "/usr/local/lib/python3.6/dist-packages/torchaudio-0.9.0a0+33b2469-py3.6-linux-aarch64.egg/torchaudio/functional/functional.py", line 118, in spectrogram
    spec_f = spec_f.reshape(shape[:-1] + spec_f.shape[-2:])
RuntimeError: shape '[1, 154, 2]' is invalid for input of size 79156
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1863, GPU 7666 (MiB)

Im running a titan rtx and can retrain if youd like me to try that on my Ubuntu 18.04 machine
Vanessa

Hi @vanessa.crosby, sorry I haven’t yet had the time to dig into this on the latest JetPack. If you go back to prior JetPack, matchboxnet will work. I believe the issue is with the newer version of PyTorch in that container for JetPack 4.6, so will have to debug that. It is unrelated to the model itself, so you needn’t retrain it.

Which version of jetpack would be optimal?

Hi Vanessa, I’m working on fixing this now - sorry for the delay. Ideally I will have an updated jetson-voice container for JetPack 4.6 to pull in the next day or so.

Awesome. I’ll wait.

OK, do a sudo docker pull dustynv/jetson-voice:r32.6.1
It will update the container and the matchboxnet/marblenet models will be working again.

@dusty_nv I am moving forward with our development project and just wanted to touch base with you, Dusty. Since our project is initially command based (eventually adding Q and A), Ill be using matchboxnet with custom intents and slots. Is Jetson-Voice still our best bet at this time to capture our ASR input data and NLU text/dashboard command data? If so, how might the migration process look once Riva is available for Jetson NX? Thank you. ~ Vanessa

Hi @vanessa.crosby, the models are actually trained with NeMo - NeMo is included in the jetson-voice containers (both for aarch64 and x86_64) and there are NeMo training scripts included in jetson-voice/scripts directory:

For example nemo_train_intent.py is what I use to train intent/slot models. I haven’t trained custom ASR or TTS models before. The dataset formats are the same that NeMo uses. These formats are typically covered in the NeMo docs and tutorials, for example the one on Joint Intent and Slot Filling.

1 Like

Local machine:
Hardware - Xavier NX : 4 cores running
Hardware - aarch64
Operating System JP 4.6.1 w/ Bionic Beaver

Dusty pulled the new image but I get this:

Error = docker: Error response from daemon: Unknown runtime specified nvidia.

Attempted to solve 4 times with different methods with latest being configuring the daemon.json file with "default-runtime": "nvidia" . Unsuccessful. Can you point me to the solution? If there are any changes I can make in the file/scripts, Im happy to help. I know you are swamped. Ill branch and commit for your approval just let me know. Thank you ~V

nx1@nx1:~/Desktop/projects/jetson-voice$ docker/run.sh
ARCH:  aarch64
reading L4T version from /etc/nv_tegra_release
L4T BSP Version:  L4T R32.6.1
CONTAINER:     dustynv/jetson-voice:r32.6.1
DEV_VOLUME:    
DATA_VOLUME:   --volume /home/nx1/Desktop/projects/jetson-voice/data:/jetson-voice/data
USER_VOLUME:   
USER_COMMAND:  
docker: Error response from daemon: Unknown runtime specified nvidia.

@vanessa.crosby I don’t think it’s necessarily related to the jetson-voice project, but rather your docker install and nvidia container runtime. Are you able to start l4t-base container?

sudo docker run -it --rm --net=host --runtime nvidia nvcr.io/nvidia/l4t-base:r32.6.1

Does checking docker info for the runtimes show nvidia runtime?

sudo docker info | grep Runtimes
 Runtimes: io.containerd.runtime.v1.linux nvidia runc io.containerd.runc.v2

If not, you may need to re-install docker/ect - these are some of the NVIDIA container packages:

apt-cache search nvidia-container
libnvidia-container-tools - NVIDIA container runtime library (command-line tools)
libnvidia-container0 - NVIDIA container runtime library
nvidia-container-csv-cuda - Jetpack CUDA CSV file
nvidia-container-csv-cudnn - Jetpack CUDNN CSV file
nvidia-container-csv-tensorrt - Jetpack TensorRT CSV file
nvidia-container-csv-visionworks - Jetpack VisionWorks CSV file
nvidia-container-runtime - NVIDIA container runtime
nvidia-container-toolkit - NVIDIA container runtime hook
nvidia-docker2 - nvidia-docker CLI wrapper
nvidia-container - NVIDIA Container Meta Package
nvidia-container-csv-opencv - Jetpack OpenCV CSV file

If the issue persists, you may just want to re-flash the device or SD card to get it back in a working state.

No I cannot start the l4t-base container. I did a fresh install of docker but Ill go ahead and reflash…

here are my runtimes nx1@nx1:~$ sudo docker info | grep Runtimes Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux