Jetson AI Lab - Home Assistant Integration

JETSON AI LAB RESEARCH GROUP

This thread is for discussion surrounding the integration of Home Assistant with Jetson and the optimized models and agents under development from Jetson AI Lab. Considering the scope and complexities of home assistant, this will be a long-term multi-phase project following this initial approach:

  1. Rebuild the Rhasspy Wyoming protocol containers for ARM64+CUDA with support for GPU-optimized versions of the most popular models there (including Piper TTS, FasterWhisper/Whisper.cpp, and openWakeWord). This will enable HomeAssistant users to host these Wyoming services locally on their Jetson with lower latencies. These modified containers should end up in the jetson-containers CI/CD build system so they are automatically rebuilt against CUDA and distributed on DockerHub.

  2. Add Wyoming wrappers/containers for Riva ASR/TTS to further enhance the performance.

  3. Provide optimized LLM backends through the HomeAssistant Conversation Agent interface.

  4. The HomeAssistant maintainers are currently scoping/defining additional AI tasks to expose interfaces for, including multimodal (image descriptions/queries). Implement support for these when they become available.

  5. For higher-level agents that implement flows defined outside of the system (for example, running on external platforms like robots), utilize the HomeAssistant REST API and sensor interfaces like mic_external/snd_external to hook into the system externally.

In parallel, engineers from Seeed Studio will be investigating deploying the core HomeAssistant Supervisor to Jetson so that it can run the entire system onboard self-contained. Thanks to Mieszko Syty for getting us started with the homeassistant-core container already!

This is an exciting project given the vast number of IoT devices that HomeAssistant supports, and their wide userbase also being into development of smart assistants. As the integration with Jetson progresses, we will be able to leverage the work being done elsewhere on model optimization and furthering of intelligent agents in order to bring these locally-hosted capabilities into the homes of people everywhere.

Anyone who wants to participate, please feel welcome to jump in and join the efforts!

5 Likes

Hi, I’m Mike 👋 the “voice guy” at Nabu Casa (the company that funds Home Assistant development) and author of the Wyoming protocol mentioned above.

I’m happy to answer any questions related to Home Assistant and A.I./voice 🙂

A bit of background that may help with understanding: Home Assistant’s voice stack is based on “pipelines”, which do the typical steps of a voice assistant. Pipelines can have the following components:

  • Wake word detection
  • Speech to text
  • Intent recognition and handling
  • Text to speech

Each component can be swapped out, and there are multiple implementations in Home Assistant. One of those implementations uses a small protocol I developed called “Wyoming” that is little more than JSON messages with an optional binary payload over TCP. This was designed with small satellite devices in mind like Espressif’s ESP chips.

Adding Wyoming-compatible services to a Jetson box would allow Home Assistant users to immediately plug those services into their existing voice assistant pipelines. There is a community PR to add Docker builds with GPU acceleration for most of the existing Wyoming services.

Audio input and output typically happens on a voice satellite, which could be a robot. Satellites send audio data into a Home Assistant voice pipeline, and get back events at different stages of the pipeline. There is a websocket API for doing this, as well as an implementation based on Wyoming. In either case, the overall flow of information looks like this:

  1. Satellite streams audio to Home Assistant
  2. Home Assistant runs the pipeline
  3. Each component of the pipeline receives data from and sends data to Home Assistant
  4. Home Assistant sends events (and possibly audio) to the satellite

Note that for (1) the satellite can do its own local wake word detection or continuously stream audio to Home Assistant and have it done remotely.

Hope this helps, and I’m excited to see where this collaboration leads 🤖


@dusty_nv Small nitpick: the project is titled “Home Assistant” not “Home-Assistant dot io” and there is a new logo.

2 Likes

Hello everyone! 👋 My name is Mieszko, and I’m an heavy user of Home Assistant alongside Jetson AGX Orin.

Recently, I took the initiative to craft the inaugural Dockerfile for running homeassistant-core on Jetson devices. Although it’s still in its early stages and encountering a few hiccups, I’m committed to refining it. With more time dedicated to this project in the coming weeks, I anticipate smoother sailing ahead.

For those interested in joining the journey, whether it’s testing the container on the edge or contributing to the MLOps aspect, your involvement is greatly appreciated. Please feel free to raise any issues or submit pull requests for this integration on our GitHub repository: Jetson Containers - homeassistant-core.

Looking forward to collaborating with you all!

2 Likes

Thanks @narandill and @hansen.mike! - updated the Home Assistant naming and logo 😄

@hansen.mike one thing I meant to ask you yesterday - what’s the timeline to merge this PR for streaming Piper TTS? Looking forward to trying the PR, just curious about the plans (presumably this will also entail a re-export of the Piper models on HuggingFace Hub to include the separate encoder/decoder ONNX models)

Some Piper TTS numbers on the AGX Orin 64GB to get the conversation started:

  • Platform: Orin AGX 64GB
  • Container: dustynv/onnxruntime

CPU via ONNX Runtime

model RTF
en_US-lessac-high.onnx 0.2217
en_US-lessac-medium.onnx 0.0481
en_US-lessac-low.onnx 0.0372

TensorRT via ONNX Runtime TensorRT

Still debugging an Illegal Instruction when loading the model

Oh thanks @michael_gruner! Can you try onnxruntime CUDAExecutionProvider as fallback for TensorRT?

Does .22 RTF mean ~4X slower than realtime, or it inverse RTFX and ~4x faster than realtime?

0.22 means 4x faster than realtime. In other words, as printed by piper:

Real-time factor: 0.2214348214359162 (infer=13.594672015999999 sec, audio=61.39356009070295 sec)
2 Likes

Guude 👋! I’m Thorsten, an enthusiast of open source voice technology, user of Home Assistant and owner of a Jetson Xavier AGX device. This being said, i’m really excited to see where this journey is going 😊. Not sure if i can be helpful, but definitve being interested.

2 Likes

Awesome @thorsten-voice, welcome! At the very least sounds like you would make a great beta tester when we have something to try! 😄

@michael_gruner that is good to know it is fast enough even on CPU (presumably on Nano CPU too), and while yes obviously we want it running well in CUDA (ideally TensorRT), if there are spare CPU cores unused by the application (mine typically don’t use much CPU really) then for Nano in particular it could potentially be beneficial/optional to run TTS on a CPU core instead, leaving GPU dedicated to LLM (that’s presuming Piper CPU isn’t already multithreaded through onnxruntime and consuming 100% CPU in those benchmarks)

Hi @dusty_nv,

This is my previous investigation of inference results, what i did for piper onnx on the Jetson AGX Orin. Unfortunately, I was not able to convert the model to the TensorRT format. @hansen.mike may have a better understanding of this.

Best regards,
Shakhizat

Thanks @shahizat, looks good even without TensorRT …TIL about onnxruntime’s cudnn_conv_algo_search setting!

I can prioritize this PR if necessary, but I had put in on hold for exactly the reason you mentioned (re-exporting all of the Piper models) 😄 Additionally, I wanted to alter the ONNX output to include the phoneme timings so people could synchronize animated lips with the TTS output.

As @michael_gruner and @shahizat have shown, Piper can run quite fast on the CPU alone. There is also already a raw output mode that will stream audio out as each individual sentence is synthesized. So it may be possible already to achieve a reasonable level of “interruptibility” and real-time response because (1) only the very first sentence needs to finish synthesis before the user will hear it, and (2) the subsequent sentences will almost surely be ready by the time the first sentence is finished speaking.

OK, yes @hansen.mike - agreed about the raw output mode and streaming by interleaving the generation at the sentence level (we can try that first). In my own stuff, I also buffer the TTS by punctuation as it comes in from the LLM, because they always sound more natural given complete phrases (ideally sentences)

It can also accumulate the chunks until it determines there will be an audio gap-out (based on the ongoing RTFX and audio duration produced thus far), so that it gets the first audio back as quickly as possible (from just the first sentence). And then thereafter generating the rest of the paragraph/ect, which also improves the voice flow. But in this ‘raw output mode’ I probably would disable that and stick to sentence-by-sentence since it’s not actually streaming and doing multiple sentences may gap out.

I’m currently working on some things with TRT-LLM and an ollama container someone submitted a PR for, but after that will add Piper to jetson-containers. It will automatically be built on top of the desired CUDA and onnxruntime. Then on top of that can go the Wyoming version of the container.

A bunch of numbers more, for completeness:

  • Platform: AGX Orin 64GB
  • Container: dustynv/onnxruntime
  • CPU usage measured per core (100%==1 core, 200%==2 cores, etc…)
  • RTF = Real time factor = inference_time/audio_time (lower is better)
  • RTF presented for different audio generation lengths
  • Test:
cat etc/test_sentences/en.txt | ./install/piper -m en_US-lessac-high.onnx --output-raw --use-cuda > /dev/null

CPU ExecutionProvider

Model 10 sec (RTF) 1min (RTF) 10min (RTF) GPU (%) CPU (%)
en_US-lessac-high.onnx 0.215 0.215 0.218 0 1100
en_US-lessac-medium.onnx 0.049 0.047 0.047 0 1000
en_US-lessac-low.onnx 0.041 0.039 0.037 0 1000

CUDA Execution Provider

Model 10 sec (RTF) 1min (RTF) 10min (RTF) GPU (%) CPU (%)
en_US-lessac-high.onnx 0.127 0.0415 0.0257 100 50
en_US-lessac-medium.onnx 0.119 0.029 0.013 70 77
en_US-lessac-low.onnx 0.113 0.029 0.011 45 80

TensorRT ExecutionProvider

*Engine is painfully slow to build, still figuring out why
*Not using DLA, unfortunately

Model 10 sec (RTF) 1min (RTF) 10min (RTF) GPU (%) CPU (%)
en_US-lessac-high.onnx 0.015 0.013 100
en_US-lessac-medium.onnx
en_US-lessac-low.onnx

Well done, @michael_gruner!!! Very informative results! I wonder if it’s possible directly convert the model to
TensorRT format using trtexec tool and without TensorRT ExecutionProvider?

@dusty_nv As you could notice I created a new PR with some fixes for homeassistant-core container: fix(homeassistant-core): Fix Dockerfile, add S6, missing deps, remove dev deps by ms1design · Pull Request #467 · dusty-nv/jetson-containers · GitHub

Unfortunately it’s crippled in the same way as previous experimental version – fails on user onboarding due to HA codebase error: New Install: Onboarding failedKeyError: 'component.onboarding.area.living_room' - Installation - Home Assistant Community@hansen.mike could you or someone from HA take a look on this please?

Thanks @narandill, merged that! 👍

piper / onnxruntime

Thanks for the benchmarks @michael_gruner, not sure why CUDA is slower than CPU for the 10-sec benchmark, unless that was just the GPU warming up. Regardless, I refactored the onnxruntime container in commit 542a0644 to build against any CUDA, and will add piper container on top of it this weekend.

ollama

@remy415 has contributed working ollama container in this PR, which I merged & tested and have uploaded images to DockerHub for:

Hopefully this also aids in our HomeAssistant integration! Thank you to the ollama maintainers! 🥳

1 Like

@dusty_nv a WIP PR with the Dockerfile for piper. It’s working, just need to add the README, test.sh, etc…

@shahizat thanks! the ONNX model contains operators that are not available in the stock TensorRT. In order to convert it to an engine we would need to implement those operators in the form of a plug-in and feed them to the deployment tool. However the model is so fast right now that i wonder if it’s even worth the effort

Looking good @michaelgruner!, thank you 🙏

Will try building this later - on my end, issue I see is that in the deployed container, I just install onnxruntime-gpu wheel from our pip server, but those wheels don’t include the onnxruntime C++ API…

This is an issue with a couple other containers though with C++ APIs like FAISS and TensorRT-LLM. PyTorch on the other hand does conveniently include it’s C++ API in its wheels.

I have two ideas on how to resolve this (in a way that scales to building containers for any CUDA combination/ect)

  1. Create an additional pip wheel and setup.py for the extra files for each of these C++ packages. The pros being, it will re-use the same pip server infrastructure. Downsides being the creation/maintenance of these setup.py files, kludgery of using package_data in setup.py, and the fact that these files would still be installed by pip under /usr/local/lib/python3.XX/dist-packages/ and would need extra steps in the dockerfile to move/link them to the expected locations (i.e. under /usr/local/include and /usr/local/lib)

  2. Run another HTTP server alongside our pip server for storing release tarballs, and the builder dockerfiles automatically uploads them to this server. The deploy dockerfile will download/extract the correct one (per your desired CUDA version). Downsides being running another server instance (not such a big deal), and getting the package version names consistent versus the pip wheels (which pip normally handles)

Wish there could be a third option to just run an apt http server for debian packages, however seeing as none of the above projects already include support for building debians, I don’t really want to get into that (similar as I don’t really want to build custom wheels). So I will probably start by giving option #2 a crack.

We could also grab stuff from other images in the Dockerfile. Something like:


COPY --from=onnxruntime:r35.4.1 /usr/local/lib/libonnxruntime.so /usr/local/lib/

Or we could have a PPA hosting Deb metapackages that distribute everything, including the wheel. Similar to how tensorrt or Cuda does it