This thread is for discussion surrounding the integration of Home Assistant with Jetson and the optimized models and agents under development from Jetson AI Lab. Considering the scope and complexities of home assistant, this will be a long-term multi-phase project following this initial approach:
Rebuild the Rhasspy Wyoming protocol containers for ARM64+CUDA with support for GPU-optimized versions of the most popular models there (including Piper TTS, FasterWhisper/Whisper.cpp, and openWakeWord). This will enable HomeAssistant users to host these Wyoming services locally on their Jetson with lower latencies. These modified containers should end up in the jetson-containers CI/CD build system so they are automatically rebuilt against CUDA and distributed on DockerHub.
Add Wyoming wrappers/containers for Riva ASR/TTS to further enhance the performance.
Provide optimized LLM backends through the HomeAssistant Conversation Agent interface.
The HomeAssistant maintainers are currently scoping/defining additional AI tasks to expose interfaces for, including multimodal (image descriptions/queries). Implement support for these when they become available.
For higher-level agents that implement flows defined outside of the system (for example, running on external platforms like robots), utilize the HomeAssistant REST API and sensor interfaces like mic_external/snd_external to hook into the system externally.
In parallel, engineers from Seeed Studio will be investigating deploying the core HomeAssistant Supervisor to Jetson so that it can run the entire system onboard self-contained. Thanks to Mieszko Syty for getting us started with the homeassistant-core container already!
This is an exciting project given the vast number of IoT devices that HomeAssistant supports, and their wide userbase also being into development of smart assistants. As the integration with Jetson progresses, we will be able to leverage the work being done elsewhere on model optimization and furthering of intelligent agents in order to bring these locally-hosted capabilities into the homes of people everywhere.
Anyone who wants to participate, please feel welcome to jump in and join the efforts!
Hi, Iām Mike š the āvoice guyā at Nabu Casa (the company that funds Home Assistant development) and author of the Wyoming protocol mentioned above.
Iām happy to answer any questions related to Home Assistant and A.I./voice š
A bit of background that may help with understanding: Home Assistantās voice stack is based on āpipelinesā, which do the typical steps of a voice assistant. Pipelines can have the following components:
Wake word detection
Speech to text
Intent recognition and handling
Text to speech
Each component can be swapped out, and there are multiple implementations in Home Assistant. One of those implementations uses a small protocol I developed called āWyomingā that is little more than JSON messages with an optional binary payload over TCP. This was designed with small satellite devices in mind like Espressifās ESP chips.
Adding Wyoming-compatible services to a Jetson box would allow Home Assistant users to immediately plug those services into their existing voice assistant pipelines. There is a community PR to add Docker builds with GPU acceleration for most of the existing Wyoming services.
Audio input and output typically happens on a voice satellite, which could be a robot. Satellites send audio data into a Home Assistant voice pipeline, and get back events at different stages of the pipeline. There is a websocket API for doing this, as well as an implementation based on Wyoming. In either case, the overall flow of information looks like this:
Satellite streams audio to Home Assistant
Home Assistant runs the pipeline
Each component of the pipeline receives data from and sends data to Home Assistant
Home Assistant sends events (and possibly audio) to the satellite
Note that for (1) the satellite can do its own local wake word detection or continuously stream audio to Home Assistant and have it done remotely.
Hope this helps, and Iām excited to see where this collaboration leads š¤
@dusty_nv Small nitpick: the project is titled āHome Assistantā not āHome-Assistant dot ioā and there is a new logo.
Hello everyone! š My name is Mieszko, and Iām an heavy user of Home Assistant alongside Jetson AGX Orin.
Recently, I took the initiative to craft the inaugural Dockerfile for running homeassistant-core on Jetson devices. Although itās still in its early stages and encountering a few hiccups, Iām committed to refining it. With more time dedicated to this project in the coming weeks, I anticipate smoother sailing ahead.
For those interested in joining the journey, whether itās testing the container on the edge or contributing to the MLOps aspect, your involvement is greatly appreciated. Please feel free to raise any issues or submit pull requests for this integration on our GitHub repository: Jetson Containers - homeassistant-core.
@hansen.mike one thing I meant to ask you yesterday - whatās the timeline to merge this PR for streaming Piper TTS? Looking forward to trying the PR, just curious about the plans (presumably this will also entail a re-export of the Piper models on HuggingFace Hub to include the separate encoder/decoder ONNX models)
Guude š! Iām Thorsten, an enthusiast of open source voice technology, user of Home Assistant and owner of a Jetson Xavier AGX device. This being said, iām really excited to see where this journey is going š. Not sure if i can be helpful, but definitve being interested.
Awesome @thorsten-voice, welcome! At the very least sounds like you would make a great beta tester when we have something to try! š
@michael_gruner that is good to know it is fast enough even on CPU (presumably on Nano CPU too), and while yes obviously we want it running well in CUDA (ideally TensorRT), if there are spare CPU cores unused by the application (mine typically donāt use much CPU really) then for Nano in particular it could potentially be beneficial/optional to run TTS on a CPU core instead, leaving GPU dedicated to LLM (thatās presuming Piper CPU isnāt already multithreaded through onnxruntime and consuming 100% CPU in those benchmarks)
This is my previous investigation of inference results, what i did for piper onnx on the Jetson AGX Orin. Unfortunately, I was not able to convert the model to the TensorRT format. @hansen.mike may have a better understanding of this.
I can prioritize this PR if necessary, but I had put in on hold for exactly the reason you mentioned (re-exporting all of the Piper models) š Additionally, I wanted to alter the ONNX output to include the phoneme timings so people could synchronize animated lips with the TTS output.
As @michael_gruner and @shahizat have shown, Piper can run quite fast on the CPU alone. There is also already a raw output mode that will stream audio out as each individual sentence is synthesized. So it may be possible already to achieve a reasonable level of āinterruptibilityā and real-time response because (1) only the very first sentence needs to finish synthesis before the user will hear it, and (2) the subsequent sentences will almost surely be ready by the time the first sentence is finished speaking.
OK, yes @hansen.mike - agreed about the raw output mode and streaming by interleaving the generation at the sentence level (we can try that first). In my own stuff, I also buffer the TTS by punctuation as it comes in from the LLM, because they always sound more natural given complete phrases (ideally sentences)
It can also accumulate the chunks until it determines there will be an audio gap-out (based on the ongoing RTFX and audio duration produced thus far), so that it gets the first audio back as quickly as possible (from just the first sentence). And then thereafter generating the rest of the paragraph/ect, which also improves the voice flow. But in this āraw output modeā I probably would disable that and stick to sentence-by-sentence since itās not actually streaming and doing multiple sentences may gap out.
Iām currently working on some things with TRT-LLM and an ollama container someone submitted a PR for, but after that will add Piper to jetson-containers. It will automatically be built on top of the desired CUDA and onnxruntime. Then on top of that can go the Wyoming version of the container.
Well done, @michael_gruner!!! Very informative results! I wonder if itās possible directly convert the model to
TensorRT format using trtexec tool and without TensorRT ExecutionProvider?
Thanks for the benchmarks @michael_gruner, not sure why CUDA is slower than CPU for the 10-sec benchmark, unless that was just the GPU warming up. Regardless, I refactored the onnxruntime container in commit 542a0644 to build against any CUDA, and will add piper container on top of it this weekend.
ollama
@remy415 has contributed working ollama container in this PR, which I merged & tested and have uploaded images to DockerHub for:
Hopefully this also aids in our HomeAssistant integration! Thank you to the ollama maintainers! š„³
@dusty_nv a WIP PR with the Dockerfile for piper. Itās working, just need to add the README, test.sh, etcā¦
@shahizat thanks! the ONNX model contains operators that are not available in the stock TensorRT. In order to convert it to an engine we would need to implement those operators in the form of a plug-in and feed them to the deployment tool. However the model is so fast right now that i wonder if itās even worth the effort
Will try building this later - on my end, issue I see is that in the deployed container, I just install onnxruntime-gpu wheel from our pip server, but those wheels donāt include the onnxruntime C++ APIā¦
This is an issue with a couple other containers though with C++ APIs like FAISS and TensorRT-LLM. PyTorch on the other hand does conveniently include itās C++ API in its wheels.
I have two ideas on how to resolve this (in a way that scales to building containers for any CUDA combination/ect)
Create an additional pip wheel and setup.py for the extra files for each of these C++ packages. The pros being, it will re-use the same pip server infrastructure. Downsides being the creation/maintenance of these setup.py files, kludgery of using package_data in setup.py, and the fact that these files would still be installed by pip under /usr/local/lib/python3.XX/dist-packages/ and would need extra steps in the dockerfile to move/link them to the expected locations (i.e. under /usr/local/include and /usr/local/lib)
Run another HTTP server alongside our pip server for storing release tarballs, and the builder dockerfiles automatically uploads them to this server. The deploy dockerfile will download/extract the correct one (per your desired CUDA version). Downsides being running another server instance (not such a big deal), and getting the package version names consistent versus the pip wheels (which pip normally handles)
Wish there could be a third option to just run an apt http server for debian packages, however seeing as none of the above projects already include support for building debians, I donāt really want to get into that (similar as I donāt really want to build custom wheels). So I will probably start by giving option #2 a crack.