Support for Qwen3-TTS on DGX Spark (GB10) | torchaudio installation failure on ARM64

I spend some days to figure it out and then told the AI to write a tutorial for it.
And here it is.

Run Qwen3-TTS on DGX Spark GB10 with Voice Cloning (OpenAI-compatible API)

I got faster-qwen3-tts running on the DGX Spark GB10 as a Docker container with an OpenAI-compatible TTS API. It works with OpenWebUI, SillyTavern, and any OpenAI TTS-compatible client.

Uses CUDA graphs for 6-10x speedup over standard inference. Real-time factor is around 0.8x on the GB10 (faster than real-time).

GitHub: GitHub - mARTin-B78/dgx-spark-faster-qwen3-tts: Run Faster-Qwen3-TTS on NVIDIA DGX Spark GB10 (ARM64/SM121/CUDA13) - OpenAI-compatible TTS API with CUDA graph acceleration · GitHub
Docker Hub: martinb78/faster-qwen3-tts-dgx-spark

What this solves

The DGX Spark’s ARM64 + Blackwell (SM 121) + CUDA 13 combo causes issues with standard ML Docker images. This image handles:

  • torchaudio ARM64 wheels (uses PyTorch’s cu130 wheel index)
  • Flash Attention won’t compile on SM 121, but CUDA graphs work great
  • max_seq_len tuned for voice cloning workloads to avoid IndexError crashes
  • OpenWebUI voice discovery endpoints (/v1/audio/voices, /v1/models)

Setup (5 minutes)

1. Download the model

mkdir -p ~/models
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base --local-dir ~/models/Qwen3-TTS

2. Clone the repo

git clone https://github.com/mARTin-B78/dgx-spark-faster-qwen3-tts.git
cd dgx-spark-faster-qwen3-tts

3. Set your model path

cp .env.example .env
nano .env

Set MODEL_PATH to wherever you downloaded the model:

MODEL_PATH=/home/youruser/models/Qwen3-TTS

4. Add voice references (optional but recommended)

Place 5-15 second audio clips (WAV or MP3) in config/speakers/ using this naming convention:

EN_M_Speaker_Name.wav    # English, Male
EN_F_Speaker_Name.wav    # English, Female
DE_M_Speaker_Name.wav    # German, Male

For each audio file, create a matching transcript file:

EN_M_Speaker_Name.reference.txt

The transcript must match what’s spoken in the audio clip. If you have a Whisper-compatible ASR service running, you can auto-transcribe:

python config/auto_transcribe.py --api-url http://localhost:8010/v1/audio/transcriptions

Important: Keep reference audio to 5-15 seconds. Longer files cause slow inference and poor voice cloning quality. The transcript must match the trimmed audio, not some longer original.

5. Create the Docker network (if you don’t have one already)

docker network create dgx_net

6. Start the container

docker compose up -d

First startup takes ~60 seconds for CUDA graph warmup. Check logs with:

docker logs -f faster-qwen3-tts

Wait until you see Uvicorn running on http://0.0.0.0:8000.

7. Test it

# Health check
curl http://localhost:8020/health

# List available voices
curl http://localhost:8020/v1/models

# Generate speech
curl -X POST http://localhost:8020/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "tts-1", "input": "Hello, this is a test of the text to speech system.", "voice": "speaker_name", "response_format": "wav"}' \
  --output test.wav

Replace speaker_name with one of the voice IDs from /v1/models.

OpenWebUI Integration

In OpenWebUI go to Settings > Audio > Text-to-Speech:

Setting Value
Engine OpenAI
URL http://faster-qwen3-tts:8000/v1 (or http://YOUR_IP:8020/v1)
API Key sk-dummy (anything works, auth is not enforced)
TTS Model tts-1
TTS Voice Pick from dropdown (auto-populated)

SillyTavern Integration

Use the /speakers endpoint to list available voices. Set the TTS provider to OpenAI-compatible and point it at http://YOUR_IP:8020.

Portainer Stack

The docker-compose.yml works directly as a Portainer stack. Just copy-paste it into Portainer’s stack editor and set the MODEL_PATH environment variable.

Performance

On DGX Spark GB10 with the 1.7B model:

Input Audio Output Generation Time RTF
Short sentence ~2s ~2.5s 0.8
Medium paragraph ~7s ~5.5s 0.77

Uses ~6 GB GPU memory.

Building from source (optional)

If you want to build the image yourself instead of pulling from Docker Hub:

docker build -t faster-qwen3-tts-dgx-spark:latest .