I spend some days to figure it out and then told the AI to write a tutorial for it.
And here it is.
Run Qwen3-TTS on DGX Spark GB10 with Voice Cloning (OpenAI-compatible API)
I got faster-qwen3-tts running on the DGX Spark GB10 as a Docker container with an OpenAI-compatible TTS API. It works with OpenWebUI, SillyTavern, and any OpenAI TTS-compatible client.
Uses CUDA graphs for 6-10x speedup over standard inference. Real-time factor is around 0.8x on the GB10 (faster than real-time).
GitHub: GitHub - mARTin-B78/dgx-spark-faster-qwen3-tts: Run Faster-Qwen3-TTS on NVIDIA DGX Spark GB10 (ARM64/SM121/CUDA13) - OpenAI-compatible TTS API with CUDA graph acceleration · GitHub
Docker Hub: martinb78/faster-qwen3-tts-dgx-spark
What this solves
The DGX Spark’s ARM64 + Blackwell (SM 121) + CUDA 13 combo causes issues with standard ML Docker images. This image handles:
- torchaudio ARM64 wheels (uses PyTorch’s
cu130wheel index) - Flash Attention won’t compile on SM 121, but CUDA graphs work great
max_seq_lentuned for voice cloning workloads to avoid IndexError crashes- OpenWebUI voice discovery endpoints (
/v1/audio/voices,/v1/models)
Setup (5 minutes)
1. Download the model
mkdir -p ~/models
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base --local-dir ~/models/Qwen3-TTS
2. Clone the repo
git clone https://github.com/mARTin-B78/dgx-spark-faster-qwen3-tts.git
cd dgx-spark-faster-qwen3-tts
3. Set your model path
cp .env.example .env
nano .env
Set MODEL_PATH to wherever you downloaded the model:
MODEL_PATH=/home/youruser/models/Qwen3-TTS
4. Add voice references (optional but recommended)
Place 5-15 second audio clips (WAV or MP3) in config/speakers/ using this naming convention:
EN_M_Speaker_Name.wav # English, Male
EN_F_Speaker_Name.wav # English, Female
DE_M_Speaker_Name.wav # German, Male
For each audio file, create a matching transcript file:
EN_M_Speaker_Name.reference.txt
The transcript must match what’s spoken in the audio clip. If you have a Whisper-compatible ASR service running, you can auto-transcribe:
python config/auto_transcribe.py --api-url http://localhost:8010/v1/audio/transcriptions
Important: Keep reference audio to 5-15 seconds. Longer files cause slow inference and poor voice cloning quality. The transcript must match the trimmed audio, not some longer original.
5. Create the Docker network (if you don’t have one already)
docker network create dgx_net
6. Start the container
docker compose up -d
First startup takes ~60 seconds for CUDA graph warmup. Check logs with:
docker logs -f faster-qwen3-tts
Wait until you see Uvicorn running on http://0.0.0.0:8000.
7. Test it
# Health check
curl http://localhost:8020/health
# List available voices
curl http://localhost:8020/v1/models
# Generate speech
curl -X POST http://localhost:8020/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "tts-1", "input": "Hello, this is a test of the text to speech system.", "voice": "speaker_name", "response_format": "wav"}' \
--output test.wav
Replace speaker_name with one of the voice IDs from /v1/models.
OpenWebUI Integration
In OpenWebUI go to Settings > Audio > Text-to-Speech:
| Setting | Value |
|---|---|
| Engine | OpenAI |
| URL | http://faster-qwen3-tts:8000/v1 (or http://YOUR_IP:8020/v1) |
| API Key | sk-dummy (anything works, auth is not enforced) |
| TTS Model | tts-1 |
| TTS Voice | Pick from dropdown (auto-populated) |
SillyTavern Integration
Use the /speakers endpoint to list available voices. Set the TTS provider to OpenAI-compatible and point it at http://YOUR_IP:8020.
Portainer Stack
The docker-compose.yml works directly as a Portainer stack. Just copy-paste it into Portainer’s stack editor and set the MODEL_PATH environment variable.
Performance
On DGX Spark GB10 with the 1.7B model:
| Input | Audio Output | Generation Time | RTF |
|---|---|---|---|
| Short sentence | ~2s | ~2.5s | 0.8 |
| Medium paragraph | ~7s | ~5.5s | 0.77 |
Uses ~6 GB GPU memory.
Building from source (optional)
If you want to build the image yourself instead of pulling from Docker Hub:
docker build -t faster-qwen3-tts-dgx-spark:latest .