Enabling API for jetson container AudioCraft

Hello,

I would like to use the jetson container AudioCraft in order to run a server that can offer an API. I want do do something similar of what I’ve already done with stable-diffusion (see here).

In the official tutorial I see that it’s started a Jupyter Web Server, but I don’t understand the reason why it’s not possible to have a front-end equivalent to what we have with stable-diffusion (MusicGen: see image)

So, my 2 questions:

  1. Is it possible with the current jetson container of AudioCraft enable a server with EndPoints in a similar fashion of what is already possible with stable-diffusion?
  2. (Optional) Is it possible to enable the MusicGen front-end in a similar fashion to stable-diffusion front-end?

Any help would be greatly appreciated.

@esteban.gallardo just browsing through the AudioCraft code on github, I don’t see it supporting REST APIs. It has API documentation for Python here:

Note that the jetson-containers for stable-diffusion-webui and text-generation-webui just run those projects, and the projects themselves implement the REST APIs you are using. It runs inside the container when you start those apps with the corresponding flags, but I didn’t add code to those projects implementing the REST APIs. If a project doesn’t implement it, you would need to expose it yourself (i.e. via a Python script that loaded the model and used flask/fastapi/ect to serve your desired REST endpoints)

I tried to import the libraries to run it in my Flask application. Unfortunatelly, I’ve not been able to import any successfully. I’ve spent several days trying to compiling repos, installing wheels, etc, etc… to get the whole thing working without any luck. Torch libraries are a nightmare. So far I work with Torch without CUDA available because it’s impossible to do it any other way. When I manage to install Torch with CUDA nothing else is compatible with it, torchvision not compatible, torchaudio not compatible, nothing works. You can clone the repos, compile them. Nothing works. I will wait for the future until there is good integration of the torch libraries in the Jetson AGX Orin.

Try building your flask application on top of the audiocraft container which already has it installed and PyTorch working. PyTorch, torchvision, torchaudio, ect do work on Jetson, you just need to have the CUDA-enabled versions installed (or build them yourself). The containers make sure the correct versions stay installed.

2 Likes

Inside the container it’s possible to run the web front-end I previously mentioned with:

demos/python3 musicgen_app.py --share --listen 0.0.0.0

The issue is that in order to complete a sound creation request it needs from the package ffmpeg.

I’m not experienced with dockers so what I’ve tried to include that package has failed.

On the other side I’m trying to create an endpoint in the container but I’m not able to install Flask in order to create the service.

Hi @esteban.gallardo , sorry for the delay - if you are still stuck on this, try running the voicecraft container instead (it is based on the audiocraft container, but also installs ffmpeg). You can still run the original audiocraft in it (or voicecraft)

Yes, I’m stuck. I’ve a lot stuff to program, a tight deadline and having to face docker problems is draining a lot of time.

VoiceCraft also didn’t work.

@esteban.gallardo just made these fixes and rebuilt the audiocraft/voicecraft containers in commit updated audiocraft and voicecraft · dusty-nv/jetson-containers@49df6bc · GitHub

  • added ffmpeg to audiocraft
  • tried to get openai-triton working, it would not
  • added XFORMERS_FORCE_DISABLE_TRITON=1 to xformers
  • added python3 demos/musicgen_app.py --listen 0.0.0.0 to audiocraft start-up

Here are the updated container images you can pull:

dustynv/audiocraft:r36.3.0
dustynv/voicecraft:r36.3.0
1 Like

Thanks for your support. Unfortunatelly there are still missing libraries when trying to load and generate audio.

By any chance is there any information about how to create a custom container? I’m a front-end developer (Unity3D) but I would like to try by myself to derivate from jetson PyTorch container, in order to install audiocraft and voicecraft with CUDA support.

@esteban.gallardo what errors did you encounter or libraries are missing? I tried running them without issue, sorry that you still have problems.

You can follow any docker tutorial to create your own dockerfile and build it, or if you want to utilize the packages already supported in jetson-containers, see here:

Thanks for the reply.

I need to be able to upload voice tracks to use the Text-To-Speech for my project. In the next video I show a step by step of how the sytem fails:

Video steps to reproduce issues

If you want to have voice audio tracks to test it you can get it from here (voice tracks).

There were other options to generate audio tracks like AllTalk that is integrated with text-generation-webui, but I’m facing a similar issue with that container too that I have explained here.

Also, in regards to Riva, I’ve done all the steps of this tutorial and it doesn’t work.

Right now I’m working with but Coqui-ai TTS without CUDA support due (it takes ages do do anything) to it’s impossible to install PyTorch with CUDA support.

It would be great to have at least one working option for Text-To-Speech for the Jetson AGX Orin.

On the other side, even though Text-To-Speech has more priority for me, I would also need in the future the possibility to generate Text-To-Audio (to create FX sounds). I haven’t seen any web front-end with API as VoiceGen or MusicGen, so it would be nice to be able to install the Python Flask so the programmers can have flexibility to implement endpoints services.

@esteban.gallardo what I have mainly been using for text-to-speech is Piper TTS:

It is lightweight, optimized with onnxruntime+CUDA, and sounds decent with the high-quality version of their models (I typically use en_US-libritts-high and pick one of the many voices that sound good)

Riva streaming TTS has a known issue right now with the timeouts - should be resolved in the next release. I also have a container for Coqui xTTS, but even with TensorRT optimizations applied, it is still sub-realtime on AGX Orin. Also unfortunately they have ceased development I believe.

In NanoLLM and Agent Studio, I support plugins for both Riva TTS and Piper TTS. Riva I believe does still sound better, but takes more memory. So on the smaller Jetsons (like Orin Nano) will use Piper.

Yes, Piper works, it’s extremely fast, but it doesn’t give me the quality I need. It’s as bad as Google’s Text-To-Speech. My project is about creating audiobooks and Piper is useless for that. Coqui-AI gives me the quality I need but it can take up to 3 minutes to synthesize a 3 sentence paragraph and I can have way longer paragraphs.

Right now I’ve plenty to program on the front-end side, but I suppose that next week I’ll try to build a container to be able to use VoiceCraft with CUDA support. Since I’m no expert in Docker I assume that I will spend one or two weeks until I have knowledge enough to make it. I would rather prefer to keep working on the front-end but I really need a decent AI text-to-speech generation.

OK, I built the standalone XTTS container again (with CUDA):

On AGX Orin, this gets a realtime factor between ~.92 - 1.0 in streaming mode, using TensorRT in my fork from github.com/dusty-nv/TTS

It sounds good (and the voice cloning feature works and is cool), but it is still too slow for my uses. You can slow the voice rate down a smidge to match that without noticing it too bad (normally I speed it up a bit actually, due to the verbose bot output). Regardless I need TTS way faster than realtime so it doesn’t consume the whole GPU, allowing it to run in the background alongside other models.

The Riva TTS still works in offline mode (meaning the entirety of the generated audio is returned at once, instead of streamed in chunks). And since it is fast, you can still approximate streaming with this by doing offline requests for each sentence or few sentences at a time.

The issues you mention about getting these packages to use CUDA (and not uninstall each other/ect) is why I use the containers.

1 Like

Thanks a lot! I have been able to generate XTTS with CUDA.

For any future programmers this is the code that workds with the container dustynv/stts:r36.3.0

# ++ INSTALL LIBRARIES ++
# apt update
# apt install ffmpeg
# pip3 install pydub

from flask import Flask, request, jsonify
import base64
import requests
import json
import hashlib
import torch
from TTS.api import TTS
from pydub import AudioSegment
import os

app = Flask(__name__)
app.config['wav_voices'] = '/home/wav_voices/en'
deviceTTS = "cuda" if torch.cuda.is_available() else "cpu"
print ("TTS MODE["+deviceTTS+"]")
print(TTS().list_models())
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(deviceTTS)

def get_unique_id(username, length=9):
    hash_object = hashlib.sha256(username.encode())
    hash_int = int(hash_object.hexdigest(), 16)
    unique_id = hash_int % (10 ** length)
    return unique_id
        
@app.route("/ai/speech", methods=["POST"])
def speech_generation() -> bytes:
        args = request.args
        prompt = request.json
        voice = prompt["voice"]
        speech = prompt["speech"]
        language = prompt["language"]
        emotion = prompt["emotion"]
        speed = prompt["speed"]

        # Speech synthesis
        path_to_voice = app.config['wav_voices']+"/"+voice
        path_to_voice_ogg = path_to_voice + ".ogg"
        path_to_voice_wav = path_to_voice + ".wav"
        wav = None
        if os.path.exists(path_to_voice_wav) is False:
            sound_data = AudioSegment.from_ogg(path_to_voice_ogg)
            sound_data.export(path_to_voice_wav, format="wav")
        
        temp_wav_file = "temp"+str(get_unique_id(speech))+".wav"
        if (len(emotion) > 0):
            tts.tts_to_file(text=speech, speaker_wav=[path_to_voice_wav], language=language, emotion=emotion, speed=speed, file_path=temp_wav_file)
        else:
            tts.tts_to_file(text=speech, speaker_wav=[path_to_voice_wav], language=language, speed=speed, file_path=temp_wav_file)

        dataaudio = AudioSegment.from_wav(temp_wav_file).export(format="ogg")
        os.remove(temp_wav_file)
        return dataaudio

@app.route("/ai/speech/voice", methods=["POST"])
def upload_speech_voice():
        voicename = request.form.get("voice")
        voicedata = request.files.get("file")

        # If the user does not select a file, the browser submits an empty file without a filename.
        if voicedata.filename == '':
            flash('No selected file')
            return jsonify({"success": False})
            
        if voicedata:
            filename = voicename + ".ogg"
            voicedata.save(os.path.join(app.config['wav_voices'], filename))
            return jsonify({"success": True})
            
        return jsonify({"success": False})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=6000, threaded=False)

The final piece for my project to work is to be able to generate short sound FXs. I hope we can make work AudioCraft for that.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.