Jetson AI Lab - Home Assistant Integration

dusty_nv · April 2, 2024, 3:08pm

JETSON AI LAB RESEARCH GROUP

home-assistant-wordmark-vertical-color-on-light480×722 9.06 KB

Project - Home Assistant Integration

Team Leads - @cyato, Seeed Studio, Mieszko Syty

This thread is for discussion surrounding the integration of Home Assistant with Jetson and the optimized models and agents under development from Jetson AI Lab. Considering the scope and complexities of home assistant, this will be a long-term multi-phase project following this initial approach:

Rebuild the Rhasspy Wyoming protocol containers for ARM64+CUDA with support for GPU-optimized versions of the most popular models there (including Piper TTS, FasterWhisper/Whisper.cpp, and openWakeWord). This will enable HomeAssistant users to host these Wyoming services locally on their Jetson with lower latencies. These modified containers should end up in the jetson-containers CI/CD build system so they are automatically rebuilt against CUDA and distributed on DockerHub.
Add Wyoming wrappers/containers for Riva ASR/TTS to further enhance the performance.
Provide optimized LLM backends through the HomeAssistant Conversation Agent interface.
The HomeAssistant maintainers are currently scoping/defining additional AI tasks to expose interfaces for, including multimodal (image descriptions/queries). Implement support for these when they become available.
For higher-level agents that implement flows defined outside of the system (for example, running on external platforms like robots), utilize the HomeAssistant REST API and sensor interfaces like mic_external/snd_external to hook into the system externally.

In parallel, engineers from Seeed Studio will be investigating deploying the core HomeAssistant Supervisor to Jetson so that it can run the entire system onboard self-contained. Thanks to Mieszko Syty for getting us started with the homeassistant-core container already!

This is an exciting project given the vast number of IoT devices that HomeAssistant supports, and their wide userbase also being into development of smart assistants. As the integration with Jetson progresses, we will be able to leverage the work being done elsewhere on model optimization and furthering of intelligent agents in order to bring these locally-hosted capabilities into the homes of people everywhere.

Anyone who wants to participate, please feel welcome to jump in and join the efforts!

hansen.mike · April 3, 2024, 3:56pm

Hi, I’m Mike 👋 the “voice guy” at Nabu Casa (the company that funds Home Assistant development) and author of the Wyoming protocol mentioned above.

I’m happy to answer any questions related to Home Assistant and A.I./voice 🙂

A bit of background that may help with understanding: Home Assistant’s voice stack is based on “pipelines”, which do the typical steps of a voice assistant. Pipelines can have the following components:

Wake word detection
Speech to text
Intent recognition and handling
Text to speech

Each component can be swapped out, and there are multiple implementations in Home Assistant. One of those implementations uses a small protocol I developed called “Wyoming” that is little more than JSON messages with an optional binary payload over TCP. This was designed with small satellite devices in mind like Espressif’s ESP chips.

Adding Wyoming-compatible services to a Jetson box would allow Home Assistant users to immediately plug those services into their existing voice assistant pipelines. There is a community PR to add Docker builds with GPU acceleration for most of the existing Wyoming services.

Audio input and output typically happens on a voice satellite, which could be a robot. Satellites send audio data into a Home Assistant voice pipeline, and get back events at different stages of the pipeline. There is a websocket API for doing this, as well as an implementation based on Wyoming. In either case, the overall flow of information looks like this:

Satellite streams audio to Home Assistant
Home Assistant runs the pipeline
Each component of the pipeline receives data from and sends data to Home Assistant
Home Assistant sends events (and possibly audio) to the satellite

Note that for (1) the satellite can do its own local wake word detection or continuously stream audio to Home Assistant and have it done remotely.

Hope this helps, and I’m excited to see where this collaboration leads 🤖

@dusty_nv Small nitpick: the project is titled “Home Assistant” not “Home-Assistant dot io” and there is a new logo.

narandill · April 3, 2024, 6:32pm

Hello everyone! 👋 My name is Mieszko, and I’m an heavy user of Home Assistant alongside Jetson AGX Orin.

Recently, I took the initiative to craft the inaugural Dockerfile for running homeassistant-core on Jetson devices. Although it’s still in its early stages and encountering a few hiccups, I’m committed to refining it. With more time dedicated to this project in the coming weeks, I anticipate smoother sailing ahead.

For those interested in joining the journey, whether it’s testing the container on the edge or contributing to the MLOps aspect, your involvement is greatly appreciated. Please feel free to raise any issues or submit pull requests for this integration on our GitHub repository: Jetson Containers - homeassistant-core.

Looking forward to collaborating with you all!

dusty_nv · April 4, 2024, 2:17pm

Thanks @narandill and @hansen.mike! - updated the Home Assistant naming and logo 😄

@hansen.mike one thing I meant to ask you yesterday - what’s the timeline to merge this PR for streaming Piper TTS? Looking forward to trying the PR, just curious about the plans (presumably this will also entail a re-export of the Piper models on HuggingFace Hub to include the separate encoder/decoder ONNX models)

michael_gruner · April 4, 2024, 4:40pm

Some Piper TTS numbers on the AGX Orin 64GB to get the conversation started:

Platform: Orin AGX 64GB
Container: dustynv/onnxruntime

CPU via ONNX Runtime

model	RTF
en_US-lessac-high.onnx	0.2217
en_US-lessac-medium.onnx	0.0481
en_US-lessac-low.onnx	0.0372

TensorRT via ONNX Runtime TensorRT

Still debugging an Illegal Instruction when loading the model

dusty_nv · April 4, 2024, 4:56pm

Oh thanks @michael_gruner! Can you try onnxruntime CUDAExecutionProvider as fallback for TensorRT?

Does .22 RTF mean ~4X slower than realtime, or it inverse RTFX and ~4x faster than realtime?

michael_gruner · April 4, 2024, 5:02pm

0.22 means 4x faster than realtime. In other words, as printed by piper:

Real-time factor: 0.2214348214359162 (infer=13.594672015999999 sec, audio=61.39356009070295 sec)

thorsten-voice · April 4, 2024, 7:41pm

Guude 👋! I’m Thorsten, an enthusiast of open source voice technology, user of Home Assistant and owner of a Jetson Xavier AGX device. This being said, i’m really excited to see where this journey is going 😊. Not sure if i can be helpful, but definitve being interested.

dusty_nv · April 4, 2024, 8:31pm

Awesome @thorsten-voice, welcome! At the very least sounds like you would make a great beta tester when we have something to try! 😄

@michael_gruner that is good to know it is fast enough even on CPU (presumably on Nano CPU too), and while yes obviously we want it running well in CUDA (ideally TensorRT), if there are spare CPU cores unused by the application (mine typically don’t use much CPU really) then for Nano in particular it could potentially be beneficial/optional to run TTS on a CPU core instead, leaving GPU dedicated to LLM (that’s presuming Piper CPU isn’t already multithreaded through onnxruntime and consuming 100% CPU in those benchmarks)

shahizat · April 4, 2024, 8:53pm

Hi @dusty_nv,

This is my previous investigation of inference results, what i did for piper onnx on the Jetson AGX Orin. Unfortunately, I was not able to convert the model to the TensorRT format. @hansen.mike may have a better understanding of this.

Best regards,
Shakhizat

dusty_nv · April 4, 2024, 10:21pm

Thanks @shahizat, looks good even without TensorRT …TIL about onnxruntime’s cudnn_conv_algo_search setting!

hansen.mike · April 4, 2024, 10:33pm

I can prioritize this PR if necessary, but I had put in on hold for exactly the reason you mentioned (re-exporting all of the Piper models) 😄 Additionally, I wanted to alter the ONNX output to include the phoneme timings so people could synchronize animated lips with the TTS output.

As @michael_gruner and @shahizat have shown, Piper can run quite fast on the CPU alone. There is also already a raw output mode that will stream audio out as each individual sentence is synthesized. So it may be possible already to achieve a reasonable level of “interruptibility” and real-time response because (1) only the very first sentence needs to finish synthesis before the user will hear it, and (2) the subsequent sentences will almost surely be ready by the time the first sentence is finished speaking.

dusty_nv · April 4, 2024, 10:48pm

OK, yes @hansen.mike - agreed about the raw output mode and streaming by interleaving the generation at the sentence level (we can try that first). In my own stuff, I also buffer the TTS by punctuation as it comes in from the LLM, because they always sound more natural given complete phrases (ideally sentences)

github.com

dusty-nv/NanoLLM/blob/79734a74eb3f07e40b910c76c05d2f70426e02df/nano_llm/plugins/audio/auto_tts.py#L68


      
                  if mode.lower() == 'none':
                      self._buffering = []
                  else:
                      self._buffering = mode.split(',')
              else:
                  if mode:
                      self._buffering = mode
                  else:
                      self._buffering = []
          
          def buffer_text(self, text):
              """
              Wait for punctuation to occur before generating the TTS, and accumulate
              as much text as possible until audio is needed, because it sounds better.
              
              The buffering methods this function uses can be controlled by setting the
              tts.buffering property, either to 'none', 'punctuation', 'time', or some
              comma-separated combination like 'punctuation,time' (which applies both)
              
              Punctuation-based buffering waits for delimiters like .,!?: to occur in the
              stream of input text (which do not proceed another alphanumeric character).

It can also accumulate the chunks until it determines there will be an audio gap-out (based on the ongoing RTFX and audio duration produced thus far), so that it gets the first audio back as quickly as possible (from just the first sentence). And then thereafter generating the rest of the paragraph/ect, which also improves the voice flow. But in this ‘raw output mode’ I probably would disable that and stick to sentence-by-sentence since it’s not actually streaming and doing multiple sentences may gap out.

I’m currently working on some things with TRT-LLM and an ollama container someone submitted a PR for, but after that will add Piper to jetson-containers. It will automatically be built on top of the desired CUDA and onnxruntime. Then on top of that can go the Wyoming version of the container.

michael_gruner · April 5, 2024, 6:00am

A bunch of numbers more, for completeness:

Platform: AGX Orin 64GB
Container: dustynv/onnxruntime
CPU usage measured per core (100%==1 core, 200%==2 cores, etc…)
RTF = Real time factor = inference_time/audio_time (lower is better)
RTF presented for different audio generation lengths
Test:

cat etc/test_sentences/en.txt | ./install/piper -m en_US-lessac-high.onnx --output-raw --use-cuda > /dev/null

CPU ExecutionProvider

Model	10 sec (RTF)	1min (RTF)	10min (RTF)	CPU (%)
en_US-lessac-high.onnx	0.215	0.215	0.218	1100
en_US-lessac-medium.onnx	0.049	0.047	0.047	1000
en_US-lessac-low.onnx	0.041	0.039	0.037	1000

CUDA Execution Provider

Model	10 sec (RTF)	1min (RTF)	10min (RTF)	GPU (%)	CPU (%)
en_US-lessac-high.onnx	0.127	0.0415	0.0257	100	50
en_US-lessac-medium.onnx	0.119	0.029	0.013	70	77
en_US-lessac-low.onnx	0.113	0.029	0.011	45	80

TensorRT ExecutionProvider

*Engine is painfully slow to build, still figuring out why
*Not using DLA, unfortunately

Model	10 sec (RTF)	1min (RTF)	GPU (%)
en_US-lessac-high.onnx	0.015	0.013	100
en_US-lessac-medium.onnx
en_US-lessac-low.onnx

shahizat · April 5, 2024, 6:35am

Well done, @michael_gruner!!! Very informative results! I wonder if it’s possible directly convert the model to
TensorRT format using trtexec tool and without TensorRT ExecutionProvider?

narandill · April 5, 2024, 1:43pm

@dusty_nv As you could notice I created a new PR with some fixes for homeassistant-core container: fix(homeassistant-core): Fix Dockerfile, add S6, missing deps, remove dev deps by ms1design · Pull Request #467 · dusty-nv/jetson-containers · GitHub

Unfortunately it’s crippled in the same way as previous experimental version – fails on user onboarding due to HA codebase error: New Install: Onboarding failedKeyError: 'component.onboarding.area.living_room' - Installation - Home Assistant Community – @hansen.mike could you or someone from HA take a look on this please?

dusty_nv · April 5, 2024, 6:36pm

Thanks @narandill, merged that! 👍

piper / onnxruntime

Thanks for the benchmarks @michael_gruner, not sure why CUDA is slower than CPU for the 10-sec benchmark, unless that was just the GPU warming up. Regardless, I refactored the onnxruntime container in commit 542a0644 to build against any CUDA, and will add piper container on top of it this weekend.

ollama

@remy415 has contributed working ollama container in this PR, which I merged & tested and have uploaded images to DockerHub for:

Hopefully this also aids in our HomeAssistant integration! Thank you to the ollama maintainers! 🥳

michael_gruner · April 6, 2024, 8:56am

@dusty_nv a WIP PR with the Dockerfile for piper. It’s working, just need to add the README, test.sh, etc…

@shahizat thanks! the ONNX model contains operators that are not available in the stock TensorRT. In order to convert it to an engine we would need to implement those operators in the form of a plug-in and feed them to the deployment tool. However the model is so fast right now that i wonder if it’s even worth the effort

dusty_nv · April 6, 2024, 2:14pm

Looking good @michaelgruner!, thank you 🙏

Will try building this later - on my end, issue I see is that in the deployed container, I just install onnxruntime-gpu wheel from our pip server, but those wheels don’t include the onnxruntime C++ API…

This is an issue with a couple other containers though with C++ APIs like FAISS and TensorRT-LLM. PyTorch on the other hand does conveniently include it’s C++ API in its wheels.

I have two ideas on how to resolve this (in a way that scales to building containers for any CUDA combination/ect)

Create an additional pip wheel and setup.py for the extra files for each of these C++ packages. The pros being, it will re-use the same pip server infrastructure. Downsides being the creation/maintenance of these setup.py files, kludgery of using package_data in setup.py, and the fact that these files would still be installed by pip under /usr/local/lib/python3.XX/dist-packages/ and would need extra steps in the dockerfile to move/link them to the expected locations (i.e. under /usr/local/include and /usr/local/lib)
Run another HTTP server alongside our pip server for storing release tarballs, and the builder dockerfiles automatically uploads them to this server. The deploy dockerfile will download/extract the correct one (per your desired CUDA version). Downsides being running another server instance (not such a big deal), and getting the package version names consistent versus the pip wheels (which pip normally handles)

Wish there could be a third option to just run an apt http server for debian packages, however seeing as none of the above projects already include support for building debians, I don’t really want to get into that (similar as I don’t really want to build custom wheels). So I will probably start by giving option #2 a crack.

michael_gruner · April 6, 2024, 3:52pm

We could also grab stuff from other images in the Dockerfile. Something like:


COPY --from=onnxruntime:r35.4.1 /usr/local/lib/libonnxruntime.so /usr/local/lib/

Or we could have a PPA hosting Deb metapackages that distribute everything, including the wheel. Similar to how tensorrt or Cuda does it

Topic		Replies	Views
Jetson AI Lab - ML DevOps, Containers, Core Inferencing Jetson Projects docker-machine-learning , generative_ai	18	1738	January 21, 2025
Can't run llamaspeak Jetson AGX Orin generative_ai	12	510	July 7, 2024
LLaMa 2 LLMs w/ NVIDIA Jetson and textgeneration-web-ui Jetson Projects generative_ai	86	24104	May 10, 2024
[jetson-voice] ASR/NLP/TTS for Jetson Jetson Projects	62	8764	December 10, 2023
NVIDIA Jetson Nano 2GB Developer Kit available now Jetson Nano	79	6333	March 10, 2022
TensorRT-LLM for jetson errors Jetson AGX Orin generative_ai , paligemma , kosmos-2 , llama	14	465	January 16, 2025
Jetson Nano Brings AI Computing to Everyone Technical Blog	71	1154	March 13, 2020
Voice Demo Container for Jetson Xavier NX not working Jetson Xavier NX audio	11	1857	October 18, 2021
Enabling API for jetson container AudioCraft Jetson AGX Orin generative_ai	15	523	June 22, 2024
Links to Jetson Nano Resources & Wiki Jetson Nano kb	75	44120	April 27, 2022