[jetson-voice] ASR/NLP/TTS for Jetson

jetson-voice is an ASR/NLP/TTS deep learning inference library for Jetson Nano, TX1/TX2, Xavier NX, and AGX Xavier. It supports Python and JetPack 4.4.1 or newer. The DNN models were trained with NeMo and deployed with TensorRT for optimized performance. All computation is performed using the onboard GPU.

Currently the following capabilities are included:

The NLP models are using the DistilBERT transformer architecture for reduced memory usage and increased performance. For samples of the text-to-speech output, see the TTS Audio Samples section.

See the GitHub repo here: https://github.com/dusty-nv/jetson-voice

3 Likes

@dusty_nv

I have developed an NLU engine with Tensorflow 2 which we now use on our Hospital Intelligent Automation Server. I used MITIE for NER but it is proving difficult to extract entities correctly when more than two entities are involved and training time for the NLU is small but MITIE training time is hours.

How well does the slot filling work with small amounts of training data? The intents and entities are quite specific to the network. IE: Turn the DEVICE on in the ZONE - tell me the MEASUREMENT in the ZONE etc. I am definitely considering using NEMO for our next release do you think it would function well with around ten to twenty examples per intent ?

EDIT it remove tags.

I’m not exactly sure about the “minimum” training data - for reference, an internal dataset we have has ~75 examples per intent. So you may or may not need more - you can test and add more as necessary. For my models I have also made some dataset-specific Python scripts that automatically generate the dataset based on common phrases/sayings that are permuted and combined.

Training the intent/slot models with NeMo doesn’t take long on a PC/server with discrete GPU, as you typically only need to train it for a handful of epochs. To put it in perspective, it takes about 15 minutes (or less) to train on my laptop with a GeForce 1070M. I use the Distilbert network because it’s significantly faster for inference while maintaining most of the accuracy of BERT-Base.

1 Like

Hi Dusty, thanks for the update. I trained the NeMo intents and slots model for 100 epochs on a 1050ti (I have a 1080 but it is on Windows machine) and it performed quite badly for entities, intents seems fine:

[NeMo I 2021-08-23 15:42:03 model:143] Query : set alarm for seven thirty am
[NeMo I 2021-08-23 15:42:03 model:144] Predicted Intent: alarm_set
[NeMo I 2021-08-23 15:42:03 model:145] Predicted Slots: O O O time time time
[NeMo I 2021-08-23 15:42:03 model:143] Query : lower volume by fifty percent
[NeMo I 2021-08-23 15:42:03 model:144] Predicted Intent: audio_volume_down
[NeMo I 2021-08-23 15:42:03 model:145] Predicted Slots: O O O O O
[NeMo I 2021-08-23 15:42:03 model:143] Query : what is my schedule for tomorrow
[NeMo I 2021-08-23 15:42:03 model:144] Predicted Intent: calendar_query
[NeMo I 2021-08-23 15:42:03 model:145] Predicted Slots: O O O O O date

I also tested some of the IoT samples which also performed badly but checking the dataset the IoT slots are not labelled which would make sense.

[NeMo I 2021-08-23 15:42:03 model:143] Query : switch the light on in the office
[NeMo I 2021-08-23 15:42:03 model:144] Predicted Intent: iot_hue_lighton
[NeMo I 2021-08-23 15:42:03 model:145] Predicted Slots: O O O O O O O
[NeMo I 2021-08-23 15:42:03 model:143] Query : will you turn the lamp off in the bedroom
[NeMo I 2021-08-23 15:42:03 model:144] Predicted Intent: iot_hue_lightoff
[NeMo I 2021-08-23 15:42:03 model:145] Predicted Slots: O O O O device_type O O O house_place
[NeMo I 2021-08-23 15:42:03 model:143] Query : what is the temperature
[NeMo I 2021-08-23 15:42:03 model:144] Predicted Intent: qa_maths
[NeMo I 2021-08-23 15:42:03 model:145] Predicted Slots: O O O O

I have started work on a version of our NLU that uses NeMo and will base on custom dataset. Will keep you informed.

Hi Dusy,
I have a problem when I run asr.py with model matchboxnet in dustynv/jetson-voice:r32.6.1.
This is the command,

$ python3 examples/asr.py --model matchboxnet --wav data/audio/commands.wav

And I got an error

[2021-09-04 01:45:55] audio.py:82 - loading audio 'data/audio/commands.wav'
Traceback (most recent call last):
  File "examples/asr.py", line 34, in <module>
    results = asr(samples)
  File "/jetson-voice/jetson_voice/models/asr/asr_engine.py", line 166, in __call__
    length=torch.as_tensor(self.buffer.size, dtype=torch.int64).unsqueeze(dim=0)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/nemo_toolkit-1.0.0rc1-py3.6.egg/nemo/core/classes/common.py", line 770, in __call__
    outputs = wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/nemo_toolkit-1.0.0rc1-py3.6.egg/nemo/collections/asr/modules/audio_preprocessing.py", line 80, in forward
    processed_signal, processed_length = self.get_features(input_signal, length)
  File "/usr/local/lib/python3.6/dist-packages/nemo_toolkit-1.0.0rc1-py3.6.egg/nemo/collections/asr/modules/audio_preprocessing.py", line 389, in get_features
    features = self.featurizer(input_signal)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torchaudio-0.9.0a0+33b2469-py3.6-linux-aarch64.egg/torchaudio/transforms.py", line 583, in forward
    mel_specgram = self.MelSpectrogram(waveform)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torchaudio-0.9.0a0+33b2469-py3.6-linux-aarch64.egg/torchaudio/transforms.py", line 520, in forward
    specgram = self.spectrogram(waveform)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torchaudio-0.9.0a0+33b2469-py3.6-linux-aarch64.egg/torchaudio/transforms.py", line 122, in forward
    self.return_complex,
  File "/usr/local/lib/python3.6/dist-packages/torchaudio-0.9.0a0+33b2469-py3.6-linux-aarch64.egg/torchaudio/functional/functional.py", line 118, in spectrogram
    spec_f = spec_f.reshape(shape[:-1] + spec_f.shape[-2:])
RuntimeError: shape '[1, 154, 2]' is invalid for input of size 79156
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1534, GPU 3511 (MiB)

Would you please give me some advice to fix, thanks.


I can run it in JetPack 4.5