I would like to use the jetson container AudioCraft in order to run a server that can offer an API. I want do do something similar of what I’ve already done with stable-diffusion (see here).
In the official tutorial I see that it’s started a Jupyter Web Server, but I don’t understand the reason why it’s not possible to have a front-end equivalent to what we have with stable-diffusion (MusicGen: see image)
Is it possible with the current jetson container of AudioCraft enable a server with EndPoints in a similar fashion of what is already possible with stable-diffusion?
(Optional) Is it possible to enable the MusicGen front-end in a similar fashion to stable-diffusion front-end?
@esteban.gallardo just browsing through the AudioCraft code on github, I don’t see it supporting REST APIs. It has API documentation for Python here:
Note that the jetson-containers for stable-diffusion-webui and text-generation-webui just run those projects, and the projects themselves implement the REST APIs you are using. It runs inside the container when you start those apps with the corresponding flags, but I didn’t add code to those projects implementing the REST APIs. If a project doesn’t implement it, you would need to expose it yourself (i.e. via a Python script that loaded the model and used flask/fastapi/ect to serve your desired REST endpoints)
I tried to import the libraries to run it in my Flask application. Unfortunatelly, I’ve not been able to import any successfully. I’ve spent several days trying to compiling repos, installing wheels, etc, etc… to get the whole thing working without any luck. Torch libraries are a nightmare. So far I work with Torch without CUDA available because it’s impossible to do it any other way. When I manage to install Torch with CUDA nothing else is compatible with it, torchvision not compatible, torchaudio not compatible, nothing works. You can clone the repos, compile them. Nothing works. I will wait for the future until there is good integration of the torch libraries in the Jetson AGX Orin.
Try building your flask application on top of the audiocraft container which already has it installed and PyTorch working. PyTorch, torchvision, torchaudio, ect do work on Jetson, you just need to have the CUDA-enabled versions installed (or build them yourself). The containers make sure the correct versions stay installed.
Hi @esteban.gallardo , sorry for the delay - if you are still stuck on this, try running the voicecraft container instead (it is based on the audiocraft container, but also installs ffmpeg). You can still run the original audiocraft in it (or voicecraft)
Thanks for your support. Unfortunatelly there are still missing libraries when trying to load and generate audio.
By any chance is there any information about how to create a custom container? I’m a front-end developer (Unity3D) but I would like to try by myself to derivate from jetson PyTorch container, in order to install audiocraft and voicecraft with CUDA support.
@esteban.gallardo what errors did you encounter or libraries are missing? I tried running them without issue, sorry that you still have problems.
You can follow any docker tutorial to create your own dockerfile and build it, or if you want to utilize the packages already supported in jetson-containers, see here:
Also, in regards to Riva, I’ve done all the steps of this tutorial and it doesn’t work.
Right now I’m working with but Coqui-ai TTS without CUDA support due (it takes ages do do anything) to it’s impossible to install PyTorch with CUDA support.
It would be great to have at least one working option for Text-To-Speech for the Jetson AGX Orin.
On the other side, even though Text-To-Speech has more priority for me, I would also need in the future the possibility to generate Text-To-Audio (to create FX sounds). I haven’t seen any web front-end with API as VoiceGen or MusicGen, so it would be nice to be able to install the Python Flask so the programmers can have flexibility to implement endpoints services.
@esteban.gallardo what I have mainly been using for text-to-speech is Piper TTS:
It is lightweight, optimized with onnxruntime+CUDA, and sounds decent with the high-quality version of their models (I typically use en_US-libritts-high and pick one of the many voices that sound good)
Riva streaming TTS has a known issue right now with the timeouts - should be resolved in the next release. I also have a container for Coqui xTTS, but even with TensorRT optimizations applied, it is still sub-realtime on AGX Orin. Also unfortunately they have ceased development I believe.
In NanoLLM and Agent Studio, I support plugins for both Riva TTS and Piper TTS. Riva I believe does still sound better, but takes more memory. So on the smaller Jetsons (like Orin Nano) will use Piper.
Yes, Piper works, it’s extremely fast, but it doesn’t give me the quality I need. It’s as bad as Google’s Text-To-Speech. My project is about creating audiobooks and Piper is useless for that. Coqui-AI gives me the quality I need but it can take up to 3 minutes to synthesize a 3 sentence paragraph and I can have way longer paragraphs.
Right now I’ve plenty to program on the front-end side, but I suppose that next week I’ll try to build a container to be able to use VoiceCraft with CUDA support. Since I’m no expert in Docker I assume that I will spend one or two weeks until I have knowledge enough to make it. I would rather prefer to keep working on the front-end but I really need a decent AI text-to-speech generation.
On AGX Orin, this gets a realtime factor between ~.92 - 1.0 in streaming mode, using TensorRT in my fork from github.com/dusty-nv/TTS
It sounds good (and the voice cloning feature works and is cool), but it is still too slow for my uses. You can slow the voice rate down a smidge to match that without noticing it too bad (normally I speed it up a bit actually, due to the verbose bot output). Regardless I need TTS way faster than realtime so it doesn’t consume the whole GPU, allowing it to run in the background alongside other models.
The Riva TTS still works in offline mode (meaning the entirety of the generated audio is returned at once, instead of streamed in chunks). And since it is fast, you can still approximate streaming with this by doing offline requests for each sentence or few sentences at a time.
The issues you mention about getting these packages to use CUDA (and not uninstall each other/ect) is why I use the containers.