Support for Qwen3-TTS on DGX Spark (GB10) | torchaudio installation failure on ARM64

Hi everyone,

I am attempting to deploy Qwen3-TTS on an NVIDIA DGX Spark (Grace Blackwell / GB10) system.

The Problem: To run Qwen3-TTS via vLLM or native PyTorch, torchaudio is required. However, I have been unable to install a functional version within the nvcr.io/nvidia/pytorch:25.11-py3 (ARM64) container or via manual builds. I have also tried the latest nvcr.io pytorch version.

  • Pip Install: pip install torchaudio fetches x86 wheels or CPU-only versions that are incompatible with the Spark’s architecture.

  • Build from Source: Compiling torchaudio from source against the pre-installed PyTorch in the NGC container fails due to header mismatches (specifically around torch/csrc/stable/device.h) and the transition of Blackwell to CUDA 13.0 / SM 12.1.

Everything I tried, eventually gives me this error:
OSError: Could not load this library: /usr/local/lib/python3.12/dist-packages/torchaudio/lib/libtorchaudio.so

Is there an official NVIDIA-provided .whl or a validated build recipe for torchaudio that is ABI-compatible with the DGX Spark’s CUDA 13 environment?

Any help getting Qwen’s voice capabilities running on this hardware would be greatly appreciated!

got the same issue with any TTS.
Would also love to know how to solve it.

torchaudio is being deprecated so is not included in NVIDIA pytorch containers. You can use pytorch build from pytorch.org instead

add the following to your .bashrc file “export PIP_EXTRA_INDEX_URL=https://download.pytorch.org/whl/cu130” and then delete torch, torch audio and torch vision and then re-install this time it should find the DGX Spark specific wheels. if that doesn’t work. you can force it with pip install torch torchaudio --index-url=https://download.pytorch.org/whl/cu130

I am running Qwen3-TTS via Comfy after installing Torchaudio as above and it runs fast!

I managed to adapt this project to the GB10 platform: GitHub - martinobettucci/Voice-Clone-Studio: A Gradio-based web UI for voice cloning and voice design, powered by Qwen3-TTS & VibeVoice. Can use Whisper or VibeVoice-ASR for automatic transcription.

Hope it helps :) s using Astral UV and install:

sudo apt update
sudo apt install -y ffmpeg sox libsox-fmt-all
sudo apt install -y espeak-ng libespeak-ng-dev libespeak-ng1

uv sync
uv run python voice_clone_studio.py --default-tenant default --allow-config

Please ignore the setup scripts and the docker files as I still have to fix them: I will eventually in the upcoming days

It works docker env by disable transformer version check in source code and mock up torch audio.

sed -i ‘57s/require_version_core/# require_version_core/’ \

/usr/local/lib/python3.12/dist-packages/transformers/dependency_versions_check.py

Step 1: Detects torchaudio → Attempts to call fbank().
Step 2: Returns None → Triggers Fallback.
Step 3: Processes audio using librosa or scipy.
Result: ✅ Success

python3 - <<EOF
import os, site
dest = site.getsitepackages()[0]

#mock-up directory

ka_path = os.path.join(dest, ‘torchaudio’, ‘compliance’)
os.makedirs(ka_path, exist_ok=True)

with open(os.path.join(ka_path, ‘kaldi.py’), ‘w’) as f:
f.write(“”"
def fbank(*args, **kwargs): return None
def spectrogram(*args, **kwargs): return None
“”")

with open(os.path.join(dest, ‘torchaudio’, ‘init.py’), ‘w’) as f:
f.write(“version = ‘2.2.2-mocked-for-blackwell’”

sox_path = os.path.join(dest, ‘sox’)
os.makedirs(sox_path, exist_ok=True)
with open(os.path.join(sox_path, ‘init.py’), ‘w’) as f:
f.write(“def getattr(name): return lambda *args, **kwargs: None”)
EOF

I spend some days to figure it out and then told the AI to write a tutorial for it.
And here it is.

Run Qwen3-TTS on DGX Spark GB10 with Voice Cloning (OpenAI-compatible API)

I got faster-qwen3-tts running on the DGX Spark GB10 as a Docker container with an OpenAI-compatible TTS API. It works with OpenWebUI, SillyTavern, and any OpenAI TTS-compatible client.

Uses CUDA graphs for 6-10x speedup over standard inference. Real-time factor is around 0.8x on the GB10 (faster than real-time).

GitHub: GitHub - mARTin-B78/dgx-spark-faster-qwen3-tts: Run Faster-Qwen3-TTS on NVIDIA DGX Spark GB10 (ARM64/SM121/CUDA13) - OpenAI-compatible TTS API with CUDA graph acceleration · GitHub
Docker Hub: martinb78/faster-qwen3-tts-dgx-spark

What this solves

The DGX Spark’s ARM64 + Blackwell (SM 121) + CUDA 13 combo causes issues with standard ML Docker images. This image handles:

  • torchaudio ARM64 wheels (uses PyTorch’s cu130 wheel index)
  • Flash Attention won’t compile on SM 121, but CUDA graphs work great
  • max_seq_len tuned for voice cloning workloads to avoid IndexError crashes
  • OpenWebUI voice discovery endpoints (/v1/audio/voices, /v1/models)

Setup (5 minutes)

1. Download the model

mkdir -p ~/models
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base --local-dir ~/models/Qwen3-TTS

2. Clone the repo

git clone https://github.com/mARTin-B78/dgx-spark-faster-qwen3-tts.git
cd dgx-spark-faster-qwen3-tts

3. Set your model path

cp .env.example .env
nano .env

Set MODEL_PATH to wherever you downloaded the model:

MODEL_PATH=/home/youruser/models/Qwen3-TTS

4. Add voice references (optional but recommended)

Place 5-15 second audio clips (WAV or MP3) in config/speakers/ using this naming convention:

EN_M_Speaker_Name.wav    # English, Male
EN_F_Speaker_Name.wav    # English, Female
DE_M_Speaker_Name.wav    # German, Male

For each audio file, create a matching transcript file:

EN_M_Speaker_Name.reference.txt

The transcript must match what’s spoken in the audio clip. If you have a Whisper-compatible ASR service running, you can auto-transcribe:

python config/auto_transcribe.py --api-url http://localhost:8010/v1/audio/transcriptions

Important: Keep reference audio to 5-15 seconds. Longer files cause slow inference and poor voice cloning quality. The transcript must match the trimmed audio, not some longer original.

5. Create the Docker network (if you don’t have one already)

docker network create dgx_net

6. Start the container

docker compose up -d

First startup takes ~60 seconds for CUDA graph warmup. Check logs with:

docker logs -f faster-qwen3-tts

Wait until you see Uvicorn running on http://0.0.0.0:8000.

7. Test it

# Health check
curl http://localhost:8020/health

# List available voices
curl http://localhost:8020/v1/models

# Generate speech
curl -X POST http://localhost:8020/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "tts-1", "input": "Hello, this is a test of the text to speech system.", "voice": "speaker_name", "response_format": "wav"}' \
  --output test.wav

Replace speaker_name with one of the voice IDs from /v1/models.

OpenWebUI Integration

In OpenWebUI go to Settings > Audio > Text-to-Speech:

Setting Value
Engine OpenAI
URL http://faster-qwen3-tts:8000/v1 (or http://YOUR_IP:8020/v1)
API Key sk-dummy (anything works, auth is not enforced)
TTS Model tts-1
TTS Voice Pick from dropdown (auto-populated)

SillyTavern Integration

Use the /speakers endpoint to list available voices. Set the TTS provider to OpenAI-compatible and point it at http://YOUR_IP:8020.

Portainer Stack

The docker-compose.yml works directly as a Portainer stack. Just copy-paste it into Portainer’s stack editor and set the MODEL_PATH environment variable.

Performance

On DGX Spark GB10 with the 1.7B model:

Input Audio Output Generation Time RTF
Short sentence ~2s ~2.5s 0.8
Medium paragraph ~7s ~5.5s 0.77

Uses ~6 GB GPU memory.

Building from source (optional)

If you want to build the image yourself instead of pulling from Docker Hub:

docker build -t faster-qwen3-tts-dgx-spark:latest .

hey good work, i got this error after running docker compose up

faster-qwen3-tts | Success! Generated voices.json with 0 mapped voices.
faster-qwen3-tts | Traceback (most recent call last):
faster-qwen3-tts | File “/config/run_server.py”, line 57, in
faster-qwen3-tts | openai_server.main()
faster-qwen3-tts | File “/app/examples/openai_server.py”, line 322, in main
faster-qwen3-tts | default_voice = next(iter(voices))
faster-qwen3-tts | ^^^^^^^^^^^^^^^^^^
faster-qwen3-tts | StopIteration

AI generated reply
 hope it helps.

Hi @saikanov, this error occurs because no voice files were detected in your speakers directory. When the server starts, it tries to set a default voice but finds an empty voices list, causing the StopIteration crash.

Quick Fix (2 options):

Option 1: Add voice reference files (Recommended)

The server needs .wav or .mp3 files in speakers. For each audio file, create a matching .reference.txt:

# Example setup:
config/speakers/
├── EN_F_Default.wav
├── EN_F_Default.reference.txt
└── EN_M_Default.wav
├── EN_M_Default.reference.txt

The .reference.txt should contain the exact transcript of what’s spoken in the audio:

Hello, this is a test voice.

Option 2: Use built-in voices (Quick workaround)

If you don’t have voice samples yet, modify examples/openai_server.py around line 322 to handle empty voices gracefully:

if voices: default_voice = next(iter(voices))else: default_voice = "default" # Fallback when no custom voices exist voices["default"] = {"ref_audio": None, "language": "English"}

Then restart: docker compose up -d

Why this happens:
When the container starts, it runs generate_voices.py
This script scans speakers for .wav and .mp3 files
If none are found → voices.json has 0 entries
Server tries to pick a default voice from empty list → Crash

How to add voices:
Record or download 5-15 second audio samples (WAV/MP3)
Name them with language/gender prefixes: EN_F_YourName.wav, DE_M_YourName.wav, etc.
Add transcripts: Create EN_F_YourName.reference.txt with what’s spoken
Place in speakers directory
Restart: docker compose restart

Let me know if this resolves it! If you’re still seeing issues, check:

docker logs faster-qwen3-tts for the full error
Whether audio files are actually being mounted in the container: docker exec faster-qwen3-tts ls -la /config/speakers/

Thanks for your quick response

i already have the voice, i generate it from kokoro.

now im facing a new error

faster-qwen3-tts | Traceback (most recent call last):
faster-qwen3-tts | File “/config/generate_voices.py”, line 57, in
faster-qwen3-tts | voices[voice_id] = entry
faster-qwen3-tts | ^^^^^^^^
faster-qwen3-tts | NameError: name ‘voice_id’ is not defined
faster-qwen3-tts exited with code 1 (restarting)

worked by changing

voices[voice_id] = entry

to

voices[base_name] = entry

on generate_voice.py line 57

I tried many times ,failed finally. Hope someone can help me
=======error below=====
jiazhixian@spark-jiazhixian:~/Models/tts/dgx-spark-faster-qwen3-tts$ docker compose down
docker compose up -d
[+] down 1/1
✔ Container faster-qwen3-tts Removed 1.5s
[+] up 1/1
✔ Container faster-qwen3-tts Created 0.1s
jiazhixian@spark-jiazhixian:~/Models/tts/dgx-spark-faster-qwen3-tts$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
74f679bf4f8e martinb78/faster-qwen3-tts-dgx-spark:latest “python3 /config/run
” 23 seconds ago Up 1 second 0.0.0.0:8020->8000/tcp, [::]:8020->8000/tcp faster-qwen3-tts
f5a197db5055 ghcr.io/open-webui/open-webui:main “bash start.sh” 3 weeks ago Up 5 hours (healthy) 0.0.0.0:3000->8080/tcp, [::]:3000->8080/tcp open-webui
jiazhixian@spark-jiazhixian:~/Models/tts/dgx-spark-faster-qwen3-tts$ curl -X POST http://localhost:8020/v1/audio/speech -H “Content-Type: application/json” -d ‘{“model”: “speaker1”, “input”: “Hello, this is a test of the text to speech system.”, “voice”: “speaker1”, “response_format”: “wav”}’ --output test.wav
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:–:-- --:–:-- --:–:-- 100 132 0 0 100 132 0 129k --:–:-- --:–:-- --:–:-- 128k
curl: (56) Recv failure: Connection reset by peer
jiazhixian@spark-jiazhixian:~/Models/tts/dgx-spark-faster-qwen3-tts$ curl -X POST http://localhost:8020/v1/audio/speech -H “Content-Type: application/json” -d ‘{“model”: “tts-1”, “input”: “Hello, this is a test of the text to speech system.”, “voice”: “speaker1”, “response_format”: “wav”}’ --output test.wav
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:–:-- --:–:-- --:–:-- 0 0 0 0 0 0 0 0 --:–:-- --:–:-- --:–:-- 0
curl: (7) Failed to connect to localhost port 8020 after 0 ms: Couldn’t connect to server
jiazhixian@spark-jiazhixian:~/Models/tts/dgx-spark-faster-qwen3-tts$ docker logs faster-qwen3-tts --tail 50
[–max-seq-len MAX_SEQ_LEN]
run_server.py: error: unrecognized arguments: --x-vector-only-mode
Success! Generated voices.json with 1 mapped voices.
usage: run_server.py [-h] [–model MODEL] [–voices FILE] [–ref-audio FILE]
[–ref-text REF_TEXT] [–language LANGUAGE] [–host HOST]
[–port PORT] [–device DEVICE]
[–max-seq-len MAX_SEQ_LEN]
run_server.py: error: unrecognized arguments: --x-vector-only-mode
Success! Generated voices.json with 1 mapped voices.
usage: run_server.py [-h] [–model MODEL] [–voices FILE] [–ref-audio FILE]
[–ref-text REF_TEXT] [–language LANGUAGE] [–host HOST]
[–port PORT] [–device DEVICE]
[–max-seq-len MAX_SEQ_LEN]
run_server.py: error: unrecognized arguments: --x-vector-only-mode
Success! Generated voices.json with 1 mapped voices.
usage: run_server.py [-h] [–model MODEL] [–voices FILE] [–ref-audio FILE]
[–ref-text REF_TEXT] [–language LANGUAGE] [–host HOST]
[–port PORT] [–device DEVICE]
[–max-seq-len MAX_SEQ_LEN]
run_server.py: error: unrecognized arguments: --x-vector-only-mode
Success! Generated voices.json with 1 mapped voices.
usage: run_server.py [-h] [–model MODEL] [–voices FILE] [–ref-audio FILE]
[–ref-text REF_TEXT] [–language LANGUAGE] [–host HOST]
[–port PORT] [–device DEVICE]
[–max-seq-len MAX_SEQ_LEN]
run_server.py: error: unrecognized arguments: --x-vector-only-mode
Success! Generated voices.json with 1 mapped voices.
usage: run_server.py [-h] [–model MODEL] [–voices FILE] [–ref-audio FILE]
[–ref-text REF_TEXT] [–language LANGUAGE] [–host HOST]
[–port PORT] [–device DEVICE]
[–max-seq-len MAX_SEQ_LEN]
run_server.py: error: unrecognized arguments: --x-vector-only-mode
Success! Generated voices.json with 1 mapped voices.
usage: run_server.py [-h] [–model MODEL] [–voices FILE] [–ref-audio FILE]
[–ref-text REF_TEXT] [–language LANGUAGE] [–host HOST]
[–port PORT] [–device DEVICE]
[–max-seq-len MAX_SEQ_LEN]
run_server.py: error: unrecognized arguments: --x-vector-only-mode
Success! Generated voices.json with 1 mapped voices.
usage: run_server.py [-h] [–model MODEL] [–voices FILE] [–ref-audio FILE]
[–ref-text REF_TEXT] [–language LANGUAGE] [–host HOST]
[–port PORT] [–device DEVICE]
[–max-seq-len MAX_SEQ_LEN]
run_server.py: error: unrecognized arguments: --x-vector-only-mode
Success! Generated voices.json with 1 mapped voices.
usage: run_server.py [-h] [–model MODEL] [–voices FILE] [–ref-audio FILE]
[–ref-text REF_TEXT] [–language LANGUAGE] [–host HOST]
[–port PORT] [–device DEVICE]
[–max-seq-len MAX_SEQ_LEN]
run_server.py: error: unrecognized arguments: --x-vector-only-mode

How to Solve It

The run_server.py script no longer supports (or never supported) the --x-vector-only-mode argument in this version.

To fix this, the user needs to:

  1. Open their docker-compose.yml file.
  2. Find the command: directive under the faster-qwen3-tts service.
  3. Remove the --x-vector-only-mode flag.
  4. Restart the container with the updated configuration:
docker compose down

docker compose up -d

How has your latency been for conversational AI ? Any round trip numbers?

not great yet

TTFA is great (2ms) — streaming is working. But the perceived slowness comes from the model itself.

Scenario RTF after warmup Meaning
Short ~0.99× generates at exactly speech speed
Medium ~0.88× slightly faster than speech
Long ~0.82× 1.2× real-time

The first short-sentence run (8.73s for 2.56s of audio) is the main culprit — that’s CUDA graph compilation on first call after startup. Every restart forces users to wait ~7 extra seconds on the very first request.

have told the AI to make it “quicker” here is what we got.

Update: Latency improvements — CUDA warmup + streaming optimizations

Just pushed a few changes that meaningfully reduce response latency:

What changed

  • CUDA warmup at startup — the server now runs a silent dummy inference during startup to pre-compile the CUDA graphs. Previously the first request after every restart paid a ~7-8s graph-compilation penalty. Now it’s warm before any real request arrives.
  • chunk_size: 4 — reduced from the default 12. For clients that support streaming playback, first audio now arrives after ~333ms of generation instead of ~1s.
  • --max-seq-len 2048 — halved the static KV cache size. Sufficient for typical TTS inputs, reduces VRAM usage and speeds up graph capture at startup.

Benchmarks (1.7B model, DGX Spark GB10)

Before After
First request (cold) ~8.7s ~2.2s
Subsequent short sentences ~1.6s ~1.6s
RTF (medium text) 0.88x 0.88x

The per-request speed is unchanged — that’s the model’s hard limit at ~1× real-time. The warmup fix is the main win.

How to update

git clone https://github.com/mARTin-B78/dgx-spark-faster-qwen3-tts
# or if you already have it:
git pull

Then restart your container — the warmup happens automatically on startup, you’ll see CUDA warmup complete — server ready. in the logs before the first request is accepted.

Thank you! Just tested all that was out there today, landed on qwen3-tts and was just looking at how to optimize it when I stumbled on your post and git. I’m setting up to use it for bidirectional communications with Openclaw agents.

Have to say it is nice and fast.

With a delay depending on the length of the text.
usually I wait for about 3 to 5 seconds after the Text was generated by the LLM.

So far the “thinking” is the longest delay
 but if I switch off thinking the reply is useless.

A real conversation like with gemini is not possible yet.

For people who are not such a big fan of the Command line I told the AI to build a Docker Containerised Web App to Clone, Design and manage Voices that can then be used with Faster Qwen3 TTS or Qwen3 Voice Designer

There are all the steps from importing an Audio File, Cropping, Normalising, Transcribing, Naming the File, adding it to the right folder
 You can add Notes and Ratings to the Voice and Show and Hide them - So you can have all the Voices in your Library but only your favourites be used by Qwen3 TTS

Voice Cloning

Library

Voice Designing

Its a Work in Progress. - Would you guys like to see a How To / Tutorial of Setting it up
 or an Installer?

@martinB78 Did you already pushed your WIP to your repo? Would like to see it.