Support for Qwen3-TTS on DGX Spark (GB10) | torchaudio installation failure on ARM64

martinB78 · April 14, 2026, 7:19am

I spend some days to figure it out and then told the AI to write a tutorial for it.
And here it is.

Run Qwen3-TTS on DGX Spark GB10 with Voice Cloning (OpenAI-compatible API)

I got faster-qwen3-tts running on the DGX Spark GB10 as a Docker container with an OpenAI-compatible TTS API. It works with OpenWebUI, SillyTavern, and any OpenAI TTS-compatible client.

Uses CUDA graphs for 6-10x speedup over standard inference. Real-time factor is around 0.8x on the GB10 (faster than real-time).

GitHub: GitHub - mARTin-B78/dgx-spark-faster-qwen3-tts: Run Faster-Qwen3-TTS on NVIDIA DGX Spark GB10 (ARM64/SM121/CUDA13) - OpenAI-compatible TTS API with CUDA graph acceleration · GitHub
Docker Hub: martinb78/faster-qwen3-tts-dgx-spark

What this solves

The DGX Spark’s ARM64 + Blackwell (SM 121) + CUDA 13 combo causes issues with standard ML Docker images. This image handles:

torchaudio ARM64 wheels (uses PyTorch’s cu130 wheel index)
Flash Attention won’t compile on SM 121, but CUDA graphs work great
max_seq_len tuned for voice cloning workloads to avoid IndexError crashes
OpenWebUI voice discovery endpoints (/v1/audio/voices, /v1/models)

Setup (5 minutes)

1. Download the model

mkdir -p ~/models
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base --local-dir ~/models/Qwen3-TTS

2. Clone the repo

git clone https://github.com/mARTin-B78/dgx-spark-faster-qwen3-tts.git
cd dgx-spark-faster-qwen3-tts

3. Set your model path

cp .env.example .env
nano .env

Set MODEL_PATH to wherever you downloaded the model:

MODEL_PATH=/home/youruser/models/Qwen3-TTS

4. Add voice references (optional but recommended)

Place 5-15 second audio clips (WAV or MP3) in config/speakers/ using this naming convention:

EN_M_Speaker_Name.wav    # English, Male
EN_F_Speaker_Name.wav    # English, Female
DE_M_Speaker_Name.wav    # German, Male

For each audio file, create a matching transcript file:

EN_M_Speaker_Name.reference.txt

The transcript must match what’s spoken in the audio clip. If you have a Whisper-compatible ASR service running, you can auto-transcribe:

python config/auto_transcribe.py --api-url http://localhost:8010/v1/audio/transcriptions

Important: Keep reference audio to 5-15 seconds. Longer files cause slow inference and poor voice cloning quality. The transcript must match the trimmed audio, not some longer original.

5. Create the Docker network (if you don’t have one already)

docker network create dgx_net

6. Start the container

docker compose up -d

First startup takes ~60 seconds for CUDA graph warmup. Check logs with:

docker logs -f faster-qwen3-tts

Wait until you see Uvicorn running on http://0.0.0.0:8000.

7. Test it

# Health check
curl http://localhost:8020/health

# List available voices
curl http://localhost:8020/v1/models

# Generate speech
curl -X POST http://localhost:8020/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "tts-1", "input": "Hello, this is a test of the text to speech system.", "voice": "speaker_name", "response_format": "wav"}' \
  --output test.wav

Replace speaker_name with one of the voice IDs from /v1/models.

OpenWebUI Integration

In OpenWebUI go to Settings > Audio > Text-to-Speech:

Setting	Value
Engine	OpenAI
URL	`http://faster-qwen3-tts:8000/v1` (or `http://YOUR_IP:8020/v1`)
API Key	`sk-dummy` (anything works, auth is not enforced)
TTS Model	`tts-1`
TTS Voice	Pick from dropdown (auto-populated)

SillyTavern Integration

Use the /speakers endpoint to list available voices. Set the TTS provider to OpenAI-compatible and point it at http://YOUR_IP:8020.

Portainer Stack

The docker-compose.yml works directly as a Portainer stack. Just copy-paste it into Portainer’s stack editor and set the MODEL_PATH environment variable.

Performance

On DGX Spark GB10 with the 1.7B model:

Input	Audio Output	Generation Time	RTF
Short sentence	~2s	~2.5s	0.8
Medium paragraph	~7s	~5.5s	0.77

Uses ~6 GB GPU memory.

Building from source (optional)

If you want to build the image yourself instead of pulling from Docker Hub:

docker build -t faster-qwen3-tts-dgx-spark:latest .

Topic		Replies	Views
Three times ( VoiceClone \| VoiceDesign \| CustomVoice ) - Faster-Qwen3-TTS for NVIDIA DGX Spark (GB10) DGX Spark / GB10 Projects docker , spark , llm , speech , llama , dgx	28	1259	June 3, 2026
xTTS in a Dockercontainer on the DGX Spark DGX Spark / GB10 Projects docker	7	956	March 25, 2026
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed Forum Feedback	1	56	April 14, 2026
Running vLLM-Omni for Qwen3-TTS(voice design, voice clone) on DGX Spark DGX Spark / GB10 Projects	8	2484	April 14, 2026
[DGX Spark] VibeVoice TTS + Streaming Voice Pipeline - Setup Guide DGX Spark / GB10 Projects cuda	0	1103	January 4, 2026
Running Parakeet speech to text on Spark DGX Spark / GB10 nim	28	2036	April 3, 2026
How to install torchaudio base on the image:nvcr.io/nvidia/pytorch:25.08-py3 Jetson Thor pytorch	8	1165	September 30, 2025
Speech-to-text STT api docker image with arm64 + GPU support DGX Spark / GB10	1	348	December 29, 2025
Running whisper.cpp STT server on DGX Spark (GB10, ARM64 + CUDA 13) via Docker DGX Spark / GB10 docker	4	248	June 1, 2026
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	4566	March 6, 2026