Three times ( VoiceClone | VoiceDesign | CustomVoice ) - Faster-Qwen3-TTS for NVIDIA DGX Spark (GB10)

Hey Guys I thought I make a new Thread instead of continuously posting on “Support for Qwen3-TTS on DGX Spark (GB10) | torchaudio installation failure on ARM64” cause this turned more into a Tutorial / Playbook than solving the installation failure.

Playbook: Running faster-qwen3-tts as a Production Voice API on Your Local GB10 GPU

faster-qwen3-tts 2126MiB / 121.7GiB 1.75%
faster-qwen3-tts-customvoice 809.1MiB / 121.7GiB 0.65%
faster-qwen3-tts-voicedesign 984.9MiB / 121.7GiB 0.79%

Overview: This guide walks you through deploying faster-qwen3-tts as a persistent, multi-voice OpenAI-compatible REST API. By the end you’ll have up to three TTS endpoints — one for voice cloning, one for instruction-designed voices, and one for built-in speaker IDs — running in Docker on your local GPU and accessible to any OpenAI-compatible client (Open WebUI, llama-swap, your own app). You do not need all three - but you can decide for yourself which VoiceClone | VoiceDesign | CustomVoice you will use.
You can find more details about Qwen3 TTS and the different Qwen3-TTS Models here.


What you’re building

┌────────────────────────────────────────────────────────┐
│  Client (Open WebUI / curl / your app)                 │
│       POST /v1/audio/speech  (OpenAI API format)       │
└────────┬───────────────┬──────────────────┬────────────┘
         │               │                  │
   port 8020       port 8021          port 8022
         │               │                  │
  ┌──────▼──────┐ ┌──────▼────────┐ ┌───────▼────────┐
  │ VoiceClone  │ │VoiceDesign    │ │  CustomVoice   │
  │ (*-Base)    │ │(*-VoiceDesign)│ │(*-CustomVoice) │
  │ref audio+   │ │text instruct  │ │ speaker name   │
  │JSON config  │ │→ voice        │ │ (built-in)     │
  └─────────────┘ └───────────────┘ └────────────────┘
         └───────────────┴──────────────────┘
              All powered by CUDA-graph inference
              (2–9× faster than vanilla PyTorch)

Prerequisites

  • NVIDIA GPU with CUDA (any modern consumer or datacenter card)
  • Docker + NVIDIA Container Toolkit
  • The faster-qwen3-tts Docker image (faster-qwen3-tts-dgx-spark:v4) built from the repo
  • Model weights downloaded from Hugging Face (pick one or all three):
    • Qwen/Qwen3-TTS-12Hz-1.7B-Base — voice cloning
    • Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign — instruction-based voices
    • Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice — built-in speakers

Step 1 — Clone the repo and build the image

git clone https://github.com/andimarafioti/faster-qwen3-tts
cd faster-qwen3-tts
docker build -t faster-qwen3-tts-dgx-spark:v4 -f dockerfile .

Step 2 — Set up your voice config files

VoiceDesign (config/voicedesign_voices.json) — describe voices in plain English:

{
  "narrator": {
    "instruct": "Warm, confident narrator with a slight British accent",
    "language": "English"
  },
  "assistant_de": {
    "instruct": "Freundliche, klare Sprecherin, Hochdeutsch, professionell",
    "language": "German"
  }
}

CustomVoice (config/customvoice_voices.json) — pick from the model’s built-in speakers:

{
  "Ryan":     { "speaker": "Ryan",     "language": "English"  },
  "Aiden":    { "speaker": "Aiden",    "language": "English"  },
  "Ono_Anna": { "speaker": "Ono_Anna", "language": "Japanese" },
  "Sohee":    { "speaker": "Sohee",    "language": "Korean"   }
}

For VoiceClone, prepare a WAV reference file and a matching transcription per voice.


Step 3 — Start the services

Copy config/docker-compose.yml and adjust the volume paths for your system, then:

cd config
docker compose up -d

Check both services are healthy:

curl http://localhost:8020/health   # VoiceClone
curl http://localhost:8021/health   # VoiceDesign

Step 4 — Generate your first audio

Standard request (uses voice config defaults):

curl http://localhost:8021/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"tts-1","input":"Welcome to the show.","voice":"narrator"}' \
  --output speech.wav

Per-request override (change language or style on the fly):

curl http://localhost:8021/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Herzlich willkommen.",
    "voice": "narrator",
    "language": "German",
    "instruct": "Speak slowly and warmly.",
    "max_new_tokens": 1024
  }' --output speech_de.wav

Per-request fields always win over the voice config entry. You can design one voice personality in JSON and let callers adjust language or tone without creating separate entries.


Step 5 — Connect to Open WebUI or llama-swap

Point your TTS URL at http://your-host:8020 (or 8021/8022). The servers implement the full OpenAI /v1/audio/speech contract including the /v1/models voice list, so no special configuration is needed — just swap the URL.


Step 6 — Benchmark your setup

Verify CUDA graphs are active and measure real-world latency:

python config/benchmark_api.py --host localhost --port 8021 --runs 5

What the numbers mean:

  • TTFA — time-to-first-audio: how long until the client can start playing. Target: under 400 ms on modern GPUs.
  • RTF — real-time factor: audio_duration / generation_time. RTF > 1.0 = faster than real-time. With CUDA graphs expect 2–5× on most GPUs.
  • Speed — same thing expressed as “Nx real-time” for intuitive reading.

If TTFA is unexpectedly high (seconds, not ms), the CUDA graph failed to capture — check your Docker logs for a capture error and ensure --max-seq-len matches across restart.


API reference

POST /v1/audio/speech

Field Type Default Description
model string "tts-1" Ignored (for OpenAI compat)
input string required Text to synthesize
voice string first entry Voice ID from your JSON config
response_format string "wav" wav, pcm, or mp3
language string from voice config Override language for this call
instruct string from voice config Override voice instruction (VoiceDesign/CustomVoice)
max_new_tokens int 2048 Max codec tokens to generate

WAV and PCM are streamed in real time as the model generates. MP3 requires pydub and is returned non-streaming.


Tips

  • Multiple GPUs: Each container gets NVIDIA_VISIBLE_DEVICES=all. If you want to pin a model to a specific GPU, set NVIDIA_VISIBLE_DEVICES=0 (VoiceClone) and NVIDIA_VISIBLE_DEVICES=1 (VoiceDesign) per service.
  • Memory: The 1.7B models use ~6 GB VRAM in bfloat16. Three services on the same card is tight — consider the 0.6B variants for multi-service setups.
  • Cold start: The first request after container start triggers CUDA graph capture (~30 s). Subsequent requests run at full CUDA-graph speed.
  • Long text: Default --max-seq-len 2048 handles most sentences. For long-form narration, raise it to 4096 (more VRAM required).

Troubleshooting

Symptom Likely cause Fix
503 Model not loaded Server still loading / CUDA graph capture in progress Wait 30–60 s after container start
404 Voice not found Voice ID not in JSON config Check spelling; call /speakers to list valid IDs
Very high TTFA (>5 s) CUDA graph capture failed, running in fallback mode Check container logs for capture error; reduce --max-seq-len
MP3 output error pydub not installed in container Use wav or pcm format, or rebuild image with pydub

Repos:

This is pretty great, I’ve been looking to make more use of qwen3 tts

Thanks,

I updated it to make it complete - its also good to experience which Qwen3 TTS LLM you want to use.

I am currently building a similar stack for Nvidia Parakeet too and a GUI for Voice Synthesis.

Update: Streaming endpoint added + Docker images now on Docker Hub (no local build needed)

Hey everyone,

A few updates since the original post:


🆕 Streaming voice clone (port 8023)

Added a fourth service to the stack — a streaming variant of the voice clone backend. Instead of waiting for the full audio to generate before playback starts, it streams WAV chunks back while still generating. Great for real-time applications or anything where perceived latency matters.

Same setup as the voice clone service (port 8020) — same Base model, same reference .wav files — just streams the output instead of buffering it. max-seq-len is bumped to 4096 to handle longer inputs without cutting off.


📦 Docker images now on Docker Hub — no local build required

Both images are now published and publicly available:

Docker will pull them automatically when you deploy the stack. No need to clone the repo and run docker build first.

The GitHub repo is also updated: GitHub - mARTin-B78/dgx-spark-faster-qwen3-tts: Run Faster-Qwen3-TTS on NVIDIA DGX Spark GB10 (ARM64/SM121/CUDA13) - OpenAI-compatible TTS API with CUDA graph acceleration · GitHub


🐳 Full Portainer stack (all 4 services)

Paste this directly into Portainer → Stacks → Add stack. Adjust the three paths to match your local setup.

# ─────────────────────────────────────────────────────────────────────────────
#  Qwen3-TTS GPU Stack for NVIDIA DGX Spark
#  GitHub: https://github.com/mARTin-B78/dgx-spark-faster-qwen3-tts
#
#  Four OpenAI-compatible TTS backends:
#    8020  ->  Voice Clone   (/v1/audio/speech, reference audio)
#    8021  ->  VoiceDesign   (text prompt describes the voice, no reference needed)
#    8022  ->  CustomVoice   (separate CustomVoice model variant)
#    8023  ->  Streaming     (same as 8020 but streams WAV chunks while generating)
#
#  ── BEFORE YOU START ──────────────────────────────────────────────────────
#
#  1. Download the models from Hugging Face into a local folder:
#       Qwen3-TTS-12Hz-1.7B-Base        (used by VoiceClone + Streaming)
#       Qwen3-TTS-12Hz-1.7B-VoiceDesign (used by VoiceDesign)
#       Qwen3-TTS-12Hz-1.7B-CustomVoice (used by CustomVoice)
#
#  2. Clone the config scripts from the GitHub repo:
#       git clone https://github.com/mARTin-B78/dgx-spark-faster-qwen3-tts.git
#       The config/ subfolder is what you need.
#
#  3. Put your speaker reference .wav files in a folder (VoiceClone + Streaming only).
#     generate_voices.py scans them on startup and builds voices.json automatically.
#
#  4. Replace every ▼ path below with your actual paths.
#
#  5. Create the external network once if it doesn't exist yet:
#       docker network create dgx_net
#     Or replace dgx_net with any bridge network you already use.
#
#  6. In Portainer → Stack → Environment variables, set:
#       HF_TOKEN = your Hugging Face token  (only needed if models are gated)
# ─────────────────────────────────────────────────────────────────────────────

services:

  # ── Voice Clone ─────────────────────────────────────────────────────────────
  # OpenAI-compatible /v1/audio/speech. Clones a voice from a reference .wav file.
  faster-qwen3-tts-voiceclone:
    image: martinb78/faster-qwen3-tts-dgx-spark:v4
    container_name: faster-qwen3-tts-voiceclone
    restart: unless-stopped
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HF_TOKEN=${HF_TOKEN}
    ports:
      - "8020:8000"
    volumes:
      - /your/path/to/models/Qwen3-TTS-12Hz-1.7B-Base:/models/Qwen3-TTS:ro   # ▼
      - /your/path/to/faster-qwen3-tts/config:/config:rw                      # ▼
      - /your/path/to/voices:/voices:ro                                        # ▼
    command: >
      /bin/bash -c "
      python3 /config/generate_voices.py &&
      python3 /config/run_server.py
      --model /models/Qwen3-TTS
      --voices /config/voices.json
      --port 8000
      --max-seq-len 2048
      "
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    networks:
      - dgx_net

  # ── VoiceDesign ──────────────────────────────────────────────────────────────
  # Generates speech from a descriptive text prompt — no reference audio needed.
  # Example: "A calm, deep male voice with a slight British accent."
  faster-qwen3-tts-voicedesign:
    image: martinb78/faster-qwen3-tts-dgx-spark:v4
    container_name: faster-qwen3-tts-voicedesign
    restart: unless-stopped
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HF_TOKEN=${HF_TOKEN}
    ports:
      - "8021:8000"
    volumes:
      - /your/path/to/models/Qwen3-TTS-12Hz-1.7B-VoiceDesign:/models/Qwen3-TTS-VoiceDesign:ro   # ▼
      - /your/path/to/faster-qwen3-tts/config:/config:rw                                         # ▼
    command: >
      /bin/bash -c "
      python3 /config/run_voicedesign_server.py
      --model /models/Qwen3-TTS-VoiceDesign
      --voices /config/voicedesign_voices.json
      --port 8000
      --max-seq-len 2048
      "
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    networks:
      - dgx_net

  # ── CustomVoice ──────────────────────────────────────────────────────────────
  # Uses the CustomVoice model variant with built-in speaker selection.
  faster-qwen3-tts-customvoice:
    image: martinb78/faster-qwen3-tts-dgx-spark:v4
    container_name: faster-qwen3-tts-customvoice
    restart: unless-stopped
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HF_TOKEN=${HF_TOKEN}
    ports:
      - "8022:8000"
    volumes:
      - /your/path/to/models/Qwen3-TTS-12Hz-1.7B-CustomVoice:/models/Qwen3-TTS-CustomVoice:ro   # ▼
      - /your/path/to/faster-qwen3-tts/config:/config:rw                                         # ▼
    command: >
      /bin/bash -c "
      python3 /config/run_customvoice_server.py
      --model /models/Qwen3-TTS-CustomVoice
      --voices /config/customvoice_voices.json
      --port 8000
      --max-seq-len 2048
      "
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    networks:
      - dgx_net

  # ── Streaming Voice Clone ────────────────────────────────────────────────────
  # Same voice cloning as port 8020, but streams WAV chunks back while generating.
  # Lower perceived latency for real-time applications.
  faster-qwen3-tts-streaming:
    image: martinb78/qwen3-tts-streaming-dgx-spark:latest
    container_name: faster-qwen3-tts-streaming
    restart: unless-stopped
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
      - HF_TOKEN=${HF_TOKEN}
      - PYTHONUNBUFFERED=1
      - QWEN_TTS_MODEL=/models/Qwen3-TTS
      - QWEN_TTS_VOICES=/config/voices.json
      - QWEN_TTS_MAX_SEQ_LEN=4096
    ports:
      - "8023:8000"
    volumes:
      - /your/path/to/models/Qwen3-TTS-12Hz-1.7B-Base:/models/Qwen3-TTS:ro   # ▼ same Base model as VoiceClone
      - /your/path/to/faster-qwen3-tts/config:/config:rw                      # ▼
      - /your/path/to/voices:/voices:ro                                        # ▼
    command: >
      /bin/bash -c "
      python3 /config/generate_voices.py &&
      python3 /config/run_server.py
      --model /models/Qwen3-TTS
      --voices /config/voices.json
      --port 8000
      --max-seq-len 4096
      "
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    networks:
      - dgx_net

networks:
  dgx_net:
    external: true
    # Create once with: docker network create dgx_net
    # Or replace with any bridge network you already use.

Hope this makes it easier for others to get started without the build step. Let me know if you run into any issues.

Hi. Would love to get this running. But not working for me.
Tried your stack template and run_customvoice_server.py is not part of your github repo.
python3: can’t open file ‘/config/run_customvoice_server.py’: [Errno 2] No such file or directory

Same with run_voicedesign_server.py

python3: can’t open file ‘/config/run_voicedesign_server.py’: [Errno 2] No such file or directory

Also streaming is not workig. I do not get a valid port, voices.json is empty and this in the logs:

Success! Generated voices.json with 0 mapped voices.

Traceback (most recent call last):

File “/config/run_server.py”, line 110, in

openai_server.main()

File “/app/examples/openai_server.py”, line 325, in main

default_voice = next(iter(voices))

                ^^^^^^^^^^^^^^^^^^

StopIteration

Could you please give more hints? What have to be in the speakers folder which is also empty in your repo?

Many thanks, Christian

Hi Christian,
thanks for testing and for the detailed error report.

I let the AI analyse your post and here is the reply.

You are right: the repo/template was ahead of the public files for a moment. it was updated the GitHub repo/README and the missing files should now be there:

  • config/run_voicedesign_server.py
  • config/run_customvoice_server.py
  • config/docker-compose.yml
  • config/voicedesign_voices.json
  • config/customvoice_voices.json

Please pull the latest version:

git pull

The public Docker images are also available now:

docker pull martinb78/faster-qwen3-tts-dgx-spark:v4 docker pull martinb78/qwen3-tts-streaming-dgx-spark:latest

The full stack exposes four OpenAI-compatible TTS backends:

8020  ->  VoiceClone   (/v1/audio/speech, reference audio)
8021  ->  VoiceDesign  (text prompt describes the voice, no reference needed)
8022  ->  CustomVoice  (separate CustomVoice model variant)
8023  ->  Streaming    (same as 8020 but streams WAV chunks while generating)

About the empty speakers folder: that is expected.
For VoiceClone and Streaming you need to add your own reference audio files there.

here are some sources

https://aiartes.com/voiceai
https://sample-files.com/downloads/audio/wav/voice-sample.wav
https://freesound.org/people/Scott%20Simpson/
https://lanceblairvo.com/raw-voiceover-samples/
https://github.com/yaph/tts-samples/tree/main/mp3
https://github.com/jim-schwoebel/voice_datasets

Example:

config/speakers/
  EN_M_Test.wav
  EN_M_Test.reference.txt

The .reference.txt file should contain the exact text spoken in the WAV file. Best results are usually with clean 5-15 second reference clips.

voices.json is generated automatically from that folder. If the folder is empty, you get:

Success! Generated voices.json with 0 mapped voices.
StopIteration

That means the VoiceClone/Streaming server has no default voice to load. So either add at least one reference voice, or comment out the 8020 and 8023 services if you only want to test VoiceDesign/CustomVoice first.

VoiceDesign and CustomVoice do not need the speakers folder. They use:

config/voicedesign_voices.json
config/customvoice_voices.json

After adding at least one reference voice, start again:

cd config
docker compose up -d

Then check:

curl http://localhost:8020/health
curl http://localhost:8021/health
curl http://localhost:8022/health
curl http://localhost:8023/health

I also updated the README with the four-container layout and the public image names. Thanks again for catching the missing-file issue.

Great! Thx so much.

Thanks for your help, unfortunately i still couldnt get it running. Here are a few minor issues I’ve noticed that cause a bit of uncertainty in the process

a) git pull

*config/run_voicedesign_server.py
*config/run_customvoice_server.py

doesnt show up, and i can´t find them in the github as well.

b) Could you add (as an example, just for starting) as well

config/speakers/
EN_M_Test.wav
EN_M_Test.reference.txt

c) to we need /voices and /speakers as subdirectories? So adding both as standard, would be nice

d) Do wen need git clone https://github.com/**andimarafioti**/faster-qwen3-tts and your https://github.com/**mARTin-B78**/dgx-spark-faster-qwen3-tts.git (or is andimarafioti just an old artifact).

e) For Nvidia Toolkit
Maybe it also has to be registered it for docker (if it’s not done before)
sudo nvidia-ctk runtime configure --runtime=docker

Thanx, again.

Thanks for testing it out and providing such detailed feedback! Here are the answers to your points:

a) Missing files after git pull
You are completely right. The run_voicedesign_server.py and run_customvoice_server.py scripts were only committed to my local branch and had not been pushed to GitHub’s main branch. I’ve just pushed them, so a fresh git pull will now bring them in!

b) Adding an example speaker for testing
Great idea. I’ve just updated the repository with two real voice samples to get you started immediately:

  • EN_M_WilliamNeural.mp3
  • EN_F_NatashaNeural.mp3
    along with their matching .reference.txt transcripts. You can find them in the config/speakers directory.

c) Do we need /voices and /speakers as subdirectories?
You actually don’t need both! The system scans both paths to support two different workflows:

  • /config/speakers is the easy, default location. You can drop files in there, and since /config is mounted read/write, it’s great for quick testing or small collections.
  • /voices is meant to be a read-only external mount. If you have a huge external directory on your host with hundreds of voices, you can mount it to /voices without cluttering your server configuration folder.
    So you can pick whichever method works best for your setup!
    I’ve also pushed a blank /voices folder to the root of the repository to ensure it exists right from the start.

d) Do we need to git clone https://github.com/andimarafioti/faster-qwen3-tts too?
No, you only need to clone my repository (mARTin-B78/dgx-spark-faster-qwen3-tts). The Docker build process automatically clones the upstream andimarafioti code inside the container when it’s being built, so you don’t have to manage it manually.

e) Nvidia Toolkit configuration (sudo nvidia-ctk runtime configure --runtime=docker)
Absolutely. The NVIDIA Container Toolkit needs to be configured for Docker to access the GPU. Good catch! I’ve just updated the README’s Hardware Requirements section to explicitly mention this command to make the setup smoother for others.

Let me know if the updated git pull gets everything working for you!

Just for your

Just a quick update. With your latest changes, my dry runs* for two model types (VoiceDesign and CustomVoice) have been running fine … My Spark is running headless and I’m on the go… but I’m trying to get some audio out of my KVM… then I can do a bit more testing… So far, so good, thankx.

Glad to hear the dry runs are working smoothly!

I am pretty much looking for the perfect. STT / TTS system and I am not a Commandline interface guy…

So I build this and a GUI (still under construction) to feed multiple TTS LLMs to it and be able to test different Voices and different Methods to get the most out of it regarding to speed and quality, and flexibility.

And I realised there are different ways to generate Voices. Eighter from a wav sample file, or a Description prompt or even a combination of that.

Qwen3 Stack

Nvidia Parakeet Stack

I also implemented a API into that GUI so it can work as a Router for TTS including a custom chime sound if you want. And you need only one address for all the TTS processing.

But for now I can tell you this

For when you’re ready to test the audio output, here is the routing cheat sheet for configuring your clients or scripts. All modes use the standard OpenAI-compatible endpoint (/v1/audio/speech), you just need to point to the correct port on your DGX Spark:

**1. VoiceClone (Base Model)**

- **Port:** `8020`

- **Base URL:** `http://<YOUR_SPARK_IP>:8020/v1`

- **Speech Endpoint:** `POST http://<YOUR_SPARK_IP>:8020/v1/audio/speech`

- **List Voices Endpoint:** `GET http://<YOUR_SPARK_IP>:8020/speakers`

**2. VoiceDesign Model**

- **Port:** `8021`

- **Base URL:** `http://<YOUR_SPARK_IP>:8021/v1`

- **Speech Endpoint:** `POST http://<YOUR_SPARK_IP>:8021/v1/audio/speech`

**3. CustomVoice Model**

- **Port:** `8022`

- **Base URL:** `http://<YOUR_SPARK_IP>:8022/v1`

- **Speech Endpoint:** `POST http://<YOUR_SPARK_IP>:8022/v1/audio/speech`

*(Optional)* **4. Low-Latency Streaming (VoiceClone)**

- **Port:** `8023`

- **Base URL:** `http://<YOUR_SPARK_IP>:8023/v1`

- **Speech Endpoint:** `POST http://<YOUR_SPARK_IP>:8023/v1/audio/speech`

If you are using a client like OpenWebUI, just put http://<YOUR_SPARK_IP>:8020/v1 as the TTS API URL and any dummy string for the API Key.

If you are deciding which model endpoint to test first, here is a quick summary of what each one does so you can choose the best fit for your project:

1. VoiceClone (Base Model on Port 8020)

  • How it works: You provide a 5-15 second .wav/.mp3 audio clip of someone speaking, along with a text transcript of what they said. The model analyzes it and clones their voice.
  • Best for: When you want to duplicate a specific real-world person (e.g., your own voice, a celebrity, or a specific character from a movie).

2. VoiceDesign (Port 8021)

  • How it works: No audio files needed! You just write a text prompt describing the voice you want (e.g., “Warm, confident narrator with a slight British accent” or “Grumpy old dwarf”), and the model generates a voice that matches that description.
  • Best for: Roleplaying games (SillyTavern, FoundryVTT), audiobooks, or whenever you need to quickly spin up a unique character voice without having to hunt down a clean audio sample.

3. CustomVoice (Port 8022)

  • How it works: Uses a set of pre-trained, built-in synthetic speakers (like “Ryan” or “Ono_Anna”) that are baked directly into the Qwen3-TTS model. No audio samples or text prompts required.
  • Best for: When you just want a highly reliable, crystal-clear, standard TTS voice (like a Siri or Alexa voice) and don’t need a specific clone or a wildly unique character.

4. Low-Latency Streaming (Port 8023)

  • How it works: This uses the exact same model (Base) and the same reference audio files as the standard VoiceClone (Port 8020), but it uses a specialized streaming image (qwen3-tts-streaming-dgx-spark:latest). Instead of waiting for the entire sentence to finish generating before returning the audio, it sends the audio back in small “chunks” (WAV streams) as soon as they are ready.

  • Best for: Real-time conversational AI and Voice Assistants (like Home Assistant pipelines). Because it streams the audio, the “Time to First Audio” (TTFA) drops to just a few hundred milliseconds, meaning the AI starts speaking almost instantly, resulting in a much more natural, interruption-free conversation.

If you just want to test if the system is generating audio properly without messing with files, CustomVoice is usually the easiest one to test first!

If you do only need one LLM for your voice just comment out the others in the Portainer Stack or remove the code.

Let me know how the audio sounds once you get your KVM routing sorted!

I am also working on getting the nvidia STT and TTS to work

NVIDIA Parakeet/Magpie Speech Stack Routing

This stack uses a central Router to handle requests, but you can also talk to the TTS and ASR containers directly if you prefer.

1. Speech Router (The Gateway)

  • Port: 8090

  • What it does: This is the main entry point. You send all your requests here, and it automatically forwards TTS requests to the Magpie container and ASR requests to the Parakeet container.

  • Base URL: http://<YOUR_SPARK_IP>:8090/v1

2. Magpie TTS (Text-to-Speech)

  • Port: 8091

  • Model: nvidia/magpie_tts_multilingual_357m

  • What it does: The actual TTS engine. You can bypass the router and talk directly to this port if you only want to generate speech.

  • Base URL: http://<YOUR_SPARK_IP>:8091/v1

  • Speech Endpoint: POST http://<YOUR_SPARK_IP>:8091/v1/audio/speech

3. Parakeet ASR (Speech-to-Text)

  • Port: 8092

  • Model: nvidia/parakeet-tdt-0.6b-v3

  • What it does: The transcription engine. You send it audio files, and it returns the text transcript.

  • Base URL: http://<YOUR_SPARK_IP>:8092/v1

  • Transcription Endpoint: POST http://<YOUR_SPARK_IP>:8092/v1/audio/transcriptions

Instead of parakeet which is not good für understanding a lot of languages I made excellent experiences with GitHub - Mekopa/whisperx-blackwell: GPU-accelerated WhisperX on NVIDIA Blackwell (SM_121) - DGX Spark compatible · GitHub .

This is a faster whisperx incl. pyanote for speaker and timestamp. With the v3 turbo model it is quit fast an a GB10.

A few minor hiccups, but they’re surely easy to fix. Okay, I now have audio over KVM (GL.iNet Comet Pro)—I never actually thought I’d need this.

a) curl http://localhost:8021/health
Just a quick note: you should wait a bit before sending the command (so the server has time to process it).

b) My primary interest right now is VoiceDesign. The command

curl http://localhost:8021/v1/audio/speech
-H “Content-Type: application/json”
-d ‘{“model”:“tts-1”,“input”:“I am Anna, welcome to the show.”,‘voice’:“anna_en”}’
–output anna_en_v05.wav

unfortunately didn’t work (VoiceDesign.json is derived from the template). It only generates an empty .wav (or .mp3) file

-H “Content-Type: application/json”
-d ‘{“model”:“tts-1”,“input”:“Welcome to the show.”,‘voice’:“anna_en”}’
–output anna_en!$.wav
–output anna_en4462d677e975.wav
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 110 0 44 100 66 10677 16015 --:–:-- --:–:-- --:–:-- 27500
curl: (18) transfer closed with outstanding read data remaining

b1) Initial guess: Does tts-1 need to be replaced?
If so, what is its name?

b2) My attempt to find the model using curl http://localhost:8021/v1/models only returned the voices (anna_de, anna_en)

c) A Docker logs entry with ID 12234566 returned the following error:

File “/config/run_voicedesign_server.py”, line 115, in producer
for chunk, _sr, _timing in tts_model.generate_voice_design_streaming(**params):
File “/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py”, line 40, in generator_context
response = gen.send(None)
^^^^^^^^^^^^^^
File “/app/faster_qwen3_tts/model.py”, line 1281, in generate_voice_design_streaming
raise ValueError(“Loaded model does not support voice design generation”)
ValueError: Loaded model does not support voice design generation

?

Hi _cjg,
I have posted your question to the AI and here is the reply..
probably better than I could write it.

Great to hear you got audio working over KVM!

a) Health Check Timing
Good catch on the timing! I actually just pushed an update to the repository that completely fixes this. The API server will now start instantly, and the /health endpoint will respond immediately with {"status": "ok", "model_loaded": false} while the 6GB model loads in the background. I also added an automatic CUDA graph “warmup” step—so once the model is loaded, your very first curl request will be lightning fast instead of timing out during compilation!

b), b1), & c) VoiceDesign Empty File and ValueError
Your initial guess in b1 is spot on, and it perfectly explains the error you found in your Docker logs in c (ValueError: Loaded model does not support voice design generation).

The tts-1 model name in your curl command is perfectly fine (it’s kept there for OpenAI client compatibility). The issue is that the underlying model weights loaded into your container on port 8021 are currently the standard “VoiceClone” (Base) model, rather than the specific VoiceDesign model. Qwen3-TTS uses completely different model weights for each of these capabilities. When the API tries to route your VoiceDesign request to the Base model, it crashes and closes the transfer, resulting in the empty .wav file you saw in b.

How to fix it:
You just need to ensure the correct VoiceDesign model is downloaded and mounted to your container running on port 8021.

  1. Make sure you’ve downloaded the specific VoiceDesign model:
    huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --local-dir /path/to/Qwen3-TTS-12Hz-1.7B-VoiceDesign
    
  2. Check your docker-compose.yml (or your docker run command) for the VoiceDesign service (port 8021) and ensure the volume mount points specifically to that VoiceDesign folder. For example:
    volumes:
      - /path/to/Qwen3-TTS-12Hz-1.7B-VoiceDesign:/models/Qwen3-TTS-VoiceDesign:ro
    

b2) The /v1/models Endpoint
Returning the voices (anna_de, anna_en) here is actually the expected behavior for this specific API wrapper. It acts as a compatibility bridge for clients like OpenWebUI, populating the client’s “models” dropdown with your available voice personalities rather than exposing the raw underlying Qwen3 model names.


Once you map the correct model folder to port 8021, your curl command with the anna_en voice should work perfectly! Let me know if you run into any other hiccups.

thanks for the tip - will try that

Ok,
i did some random changes, now voice_design is running. Will test it tomorrow. Thanx.

Hi, I’ve tested VoiceClone and Streaming a bit now. With both the VoiceClone model and Streaming, the voices/languages aren’t consistent. The timbre changes quite significantly, so they aren’t perceived as a single voice. This is a deal-breaker for audio drama generation, meaning it’s basically unusable as is.

a) Could you implement the “temperature” and “top_p” parameters in the config so we can test different values here?
b) Do you have an additional parameter that can be used to stabilize the voice?
c) Do you know if this can also be mitigated by optimizing the text sample? Do you have any experience with this?

Thanx

Hey cjg, thanks for your feedback. You are right the voice was not constant. I told the AI to fix it. - The new Docker Images are live on Docker Hub:

  • martinb78/faster-qwen3-tts-dgx-spark:v5 — the pinned release
  • martinb78/faster-qwen3-tts-dgx-spark:latest — updated to v5

Users can pull with:

docker pull martinb78/faster-qwen3-tts-dgx-spark:v5

Here is the reply.

Short story

a) temperature, top_k, and top_p are now configurable per voice in voices.json — v5 ships with defaults of 0.8 / 50 / 0.9; lower temperature and top_p toward 0.7 / 0.85 for audio drama work.

b) The biggest stabiliser beyond sampling parameters is a clean 8–12 s reference recording that matches the target style and pace — no parameter can compensate for a weak voice reference.

c) Yes: split long texts at sentence boundaries (~15–25 words per call), use proper punctuation, and keep each call stylistically uniform — short, well-punctuated segments are the single most effective thing you can do for timbre consistency.

Long story

The timbre changes quite significantly, so they aren’t perceived as a single voice. This is a deal-breaker for audio drama generation.

I found the root cause and fixed it in v5 — and your three questions led right to the relevant knobs to expose. Let me answer each.


What was actually broken

The VoiceClone server was using an internal text-feeding mode designed for streaming LLM→TTS pipelines (where text arrives token by token from a language model). In that mode only one text token enters the model’s KV cache during prefill. The rest of the text is injected one step per codec frame during decoding — until that buffer runs out.

For a typical 54-word paragraph that buffer covers only 4 seconds of speech. A sentence of that length takes ~18 seconds to say, so 77% of the audio is generated with the model having no idea what word it is supposed to say next. It free-runs, drifting in timbre and sometimes flipping gender entirely.

v5 switches to non_streaming_mode=True — the full text goes into the prefill and stays accessible throughout generation. VoiceDesign and CustomVoice already used this mode and were therefore unaffected.


a) temperature and top_p in voices.json — done in v5

Both are now per-voice configurable. Add them to any voice entry in voices.json:

{
  "William": {
    "ref_audio": "/voices/william.wav",
    "ref_text": "...",
    "language": "English",
    "temperature": 0.7,
    "top_k": 40,
    "top_p": 0.85
  }
}

The v5 defaults are temperature=0.8, top_k=50, top_p=0.9 — more conservative than before (was temperature=0.9, no top_p). For audio drama where voice consistency is critical, start around temperature=0.65–0.75 and top_p=0.85. Lower temperature reduces variance; top_p cuts the long tail of unlikely tokens that cause sudden timbre jumps.


b) Additional stabilisation parameters

Beyond temperature/top_p, there are a few more levers:

repetition_penalty — already present internally (default 1.05). If you hear looping or stuttering, lower it toward 1.0. If the voice wanders too much, a slightly higher value (1.07–1.10) pushes the model away from token patterns it has already settled into. You can expose this the same way once you find a useful range.

Reference audio quality matters more than any sampling parameter:

  • 8–12 seconds of clean, neutral-paced speech works best. Too short and the speaker identity is weak; too long risks exceeding the sequence length budget.
  • The reference should be spoken at a pace and energy level matching what you want generated. An excited reference tends to produce excited output even when reading calm prose.
  • Silence appended to the reference (the default append_silence=0.5s) prevents the last phoneme from bleeding into the start of the generated speech — keep that on.

Language tag — always set "language": "English" (or whichever language you need) explicitly rather than "Auto". With Auto the model makes its own guess and can subtly shift prosody.


c) Does text optimisation help?

Yes — significantly. The model is a language model generating codec tokens, so the text structure affects the output as much as the sampling parameters do.

  • Shorter segments sound more consistent. For audio drama, splitting at natural sentence boundaries (roughly 15–25 words per call) keeps each generation short enough that the model never leaves the well-conditioned zone. Stitch the WAV chunks in post; the boundary artefacts are minimal.
  • Punctuation guides prosody. Commas give the model pause cues; em-dashes work well for dramatic beats. Run-on sentences with no punctuation tend to produce flat or wandering intonation.
  • Avoid abrupt style shifts within one call. A single API call that goes from narration to dialogue to narration again can trip the model. Keep each call stylistically uniform and let your pipeline manage the transitions.
  • The reference transcript matters. The ref_text should be the verbatim transcript of your reference audio — not a paraphrase. A mismatch between what the model “hears” in the codec tokens and what the text says can destabilise the conditioning.
  • Match the register. If your reference audio is a calm, neutral reading, the model clones that register well. If you then ask it to generate something highly emphatic, it may drift. Either record a reference that matches the target style, or expect some variance.

How to update to v5

docker pull martinb78/faster-qwen3-tts-dgx-spark:v5

Update the image tag in your docker-compose.yml and restart. No other changes needed — temperature, top_k, and top_p are optional in voices.json; existing configs without them fall back to the new defaults.


Update — 2026-05-30 (v6): Docker file structure and streaming image consolidated

The repo has been cleaned up:

One Docker Hub repo instead of two. The separate martinb78/qwen3-tts-streaming-dgx-spark image has been removed. The streaming service now lives as a tag in the same repo:

martinb78/faster-qwen3-tts-dgx-spark:latest    # VoiceClone, VoiceDesign, CustomVoice
martinb78/faster-qwen3-tts-dgx-spark:streaming # Streaming VoiceClone

Compose files moved to docker/:

Before After
docker-compose.yml (root) docker/docker-compose.simple.yml
config/docker-compose.yml docker/docker-compose.yml

If you copied the old config/docker-compose.yml into Portainer, update the streaming service image from martinb78/qwen3-tts-streaming-dgx-spark:latest to martinb78/faster-qwen3-tts-dgx-spark:streaming.

Hi, thanks so much. I’ll test the new implementations soon. Two questions:

a) I have an organizational issue with
– martinb78/faster-qwen3-tts-dgx-spark:latest because I might just end up using an older version. Could you offer a – martinb78/faster-qwen3-tts-dgx-spark:v6 … v7 … v8 in addition to the – martinb78/faster-qwen3-tts-dgx-spark:latest? :latest would then refer to the latest version, and we can set :streaming ourselves anyway. What do you think?

b) Just to clarify, I think the addition to voices.json is great. Is this actually read in with every text input/forwarding, or only once during container generation? I can and will test it myself. I was just curious because the voices.json is generated, and if it isn’t always read in, we can’t manually adjust the temp parameters, right?

These are just cosmetic details, but surely you or your AI can answer that easily?