Hey Guys I thought I make a new Thread instead of continuously posting on “Support for Qwen3-TTS on DGX Spark (GB10) | torchaudio installation failure on ARM64” cause this turned more into a Tutorial / Playbook than solving the installation failure.
Playbook: Running faster-qwen3-tts as a Production Voice API on Your Local GB10 GPU
faster-qwen3-tts 2126MiB / 121.7GiB 1.75%
faster-qwen3-tts-customvoice 809.1MiB / 121.7GiB 0.65%
faster-qwen3-tts-voicedesign 984.9MiB / 121.7GiB 0.79%
Overview: This guide walks you through deploying faster-qwen3-tts as a persistent, multi-voice OpenAI-compatible REST API. By the end you’ll have up to three TTS endpoints — one for voice cloning, one for instruction-designed voices, and one for built-in speaker IDs — running in Docker on your local GPU and accessible to any OpenAI-compatible client (Open WebUI, llama-swap, your own app). You do not need all three - but you can decide for yourself which VoiceClone | VoiceDesign | CustomVoice you will use.
You can find more details about Qwen3 TTS and the different Qwen3-TTS Models here.
What you’re building
┌────────────────────────────────────────────────────────┐
│ Client (Open WebUI / curl / your app) │
│ POST /v1/audio/speech (OpenAI API format) │
└────────┬───────────────┬──────────────────┬────────────┘
│ │ │
port 8020 port 8021 port 8022
│ │ │
┌──────▼──────┐ ┌──────▼────────┐ ┌───────▼────────┐
│ VoiceClone │ │VoiceDesign │ │ CustomVoice │
│ (*-Base) │ │(*-VoiceDesign)│ │(*-CustomVoice) │
│ref audio+ │ │text instruct │ │ speaker name │
│JSON config │ │→ voice │ │ (built-in) │
└─────────────┘ └───────────────┘ └────────────────┘
└───────────────┴──────────────────┘
All powered by CUDA-graph inference
(2–9× faster than vanilla PyTorch)
Prerequisites
- NVIDIA GPU with CUDA (any modern consumer or datacenter card)
- Docker + NVIDIA Container Toolkit
- The faster-qwen3-tts Docker image (
faster-qwen3-tts-dgx-spark:v4) built from the repo - Model weights downloaded from Hugging Face (pick one or all three):
Qwen/Qwen3-TTS-12Hz-1.7B-Base— voice cloningQwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign— instruction-based voicesQwen/Qwen3-TTS-12Hz-1.7B-CustomVoice— built-in speakers
Step 1 — Clone the repo and build the image
git clone https://github.com/andimarafioti/faster-qwen3-tts
cd faster-qwen3-tts
docker build -t faster-qwen3-tts-dgx-spark:v4 -f dockerfile .
Step 2 — Set up your voice config files
VoiceDesign (config/voicedesign_voices.json) — describe voices in plain English:
{
"narrator": {
"instruct": "Warm, confident narrator with a slight British accent",
"language": "English"
},
"assistant_de": {
"instruct": "Freundliche, klare Sprecherin, Hochdeutsch, professionell",
"language": "German"
}
}
CustomVoice (config/customvoice_voices.json) — pick from the model’s built-in speakers:
{
"Ryan": { "speaker": "Ryan", "language": "English" },
"Aiden": { "speaker": "Aiden", "language": "English" },
"Ono_Anna": { "speaker": "Ono_Anna", "language": "Japanese" },
"Sohee": { "speaker": "Sohee", "language": "Korean" }
}
For VoiceClone, prepare a WAV reference file and a matching transcription per voice.
Step 3 — Start the services
Copy config/docker-compose.yml and adjust the volume paths for your system, then:
cd config
docker compose up -d
Check both services are healthy:
curl http://localhost:8020/health # VoiceClone
curl http://localhost:8021/health # VoiceDesign
Step 4 — Generate your first audio
Standard request (uses voice config defaults):
curl http://localhost:8021/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"tts-1","input":"Welcome to the show.","voice":"narrator"}' \
--output speech.wav
Per-request override (change language or style on the fly):
curl http://localhost:8021/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"input": "Herzlich willkommen.",
"voice": "narrator",
"language": "German",
"instruct": "Speak slowly and warmly.",
"max_new_tokens": 1024
}' --output speech_de.wav
Per-request fields always win over the voice config entry. You can design one voice personality in JSON and let callers adjust language or tone without creating separate entries.
Step 5 — Connect to Open WebUI or llama-swap
Point your TTS URL at http://your-host:8020 (or 8021/8022). The servers implement the full OpenAI /v1/audio/speech contract including the /v1/models voice list, so no special configuration is needed — just swap the URL.
Step 6 — Benchmark your setup
Verify CUDA graphs are active and measure real-world latency:
python config/benchmark_api.py --host localhost --port 8021 --runs 5
What the numbers mean:
- TTFA — time-to-first-audio: how long until the client can start playing. Target: under 400 ms on modern GPUs.
- RTF — real-time factor:
audio_duration / generation_time. RTF > 1.0 = faster than real-time. With CUDA graphs expect 2–5× on most GPUs. - Speed — same thing expressed as “Nx real-time” for intuitive reading.
If TTFA is unexpectedly high (seconds, not ms), the CUDA graph failed to capture — check your Docker logs for a capture error and ensure --max-seq-len matches across restart.
API reference
POST /v1/audio/speech
| Field | Type | Default | Description |
|---|---|---|---|
model |
string | "tts-1" |
Ignored (for OpenAI compat) |
input |
string | required | Text to synthesize |
voice |
string | first entry | Voice ID from your JSON config |
response_format |
string | "wav" |
wav, pcm, or mp3 |
language |
string | from voice config | Override language for this call |
instruct |
string | from voice config | Override voice instruction (VoiceDesign/CustomVoice) |
max_new_tokens |
int | 2048 |
Max codec tokens to generate |
WAV and PCM are streamed in real time as the model generates. MP3 requires pydub and is returned non-streaming.
Tips
- Multiple GPUs: Each container gets
NVIDIA_VISIBLE_DEVICES=all. If you want to pin a model to a specific GPU, setNVIDIA_VISIBLE_DEVICES=0(VoiceClone) andNVIDIA_VISIBLE_DEVICES=1(VoiceDesign) per service. - Memory: The 1.7B models use ~6 GB VRAM in bfloat16. Three services on the same card is tight — consider the 0.6B variants for multi-service setups.
- Cold start: The first request after container start triggers CUDA graph capture (~30 s). Subsequent requests run at full CUDA-graph speed.
- Long text: Default
--max-seq-len 2048handles most sentences. For long-form narration, raise it to 4096 (more VRAM required).
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
503 Model not loaded |
Server still loading / CUDA graph capture in progress | Wait 30–60 s after container start |
404 Voice not found |
Voice ID not in JSON config | Check spelling; call /speakers to list valid IDs |
| Very high TTFA (>5 s) | CUDA graph capture failed, running in fallback mode | Check container logs for capture error; reduce --max-seq-len |
| MP3 output error | pydub not installed in container |
Use wav or pcm format, or rebuild image with pydub |
Repos:





