Hello,
This guide walks through setting up vllm-omni for running the Qwen3-TTS model family across different NVIDIA hardware platforms. Source: Qwen3-TTS - vLLM-Omni
Create and activate a dedicated virtual environment using uv
uv venv .vllm --python 3.12
source .vllm/bin/activate
Install System Dependencies
Audio processing requires ffmpeg and sox :
sudo apt-get update
sudo apt-get install ffmpeg sox -y
Install vLLM (Platform-Specific)
For x86_64 Machines (CUDA 13.0)
uv pip install \
https://github.com/vllm-project/vllm/releases/download/v0.16.0/vllm-0.16.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl \
--extra-index-url https://download.pytorch.org/whl/cu130 \
--index-strategy unsafe-best-match
For ARM64 Platforms: DGX Spark & Jetson Thor (CUDA 13.0)
uv pip install \
https://github.com/vllm-project/vllm/releases/download/v0.16.0/vllm-0.16.0+cu130-cp38-abi3-manylinux_2_35_aarch64.whl \
--extra-index-url https://download.pytorch.org/whl/cu130 \
--index-strategy unsafe-best-match
Build vLLM-Omni from Source
Required for latest features and custom modifications:
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
The fa3-fwd package does not provide aarch64 wheels. If you’re on DGX Spark or Jetson Thor:
- Edit
vllm-omni/requirements/cuda.txt - Remove or comment out:
fa3-fwd==0.0.2
Then install the package in editable mode:
uv pip install -e .
Install Flash Attention
- Flash Attention 3 (
fa3-fwd) is not compatible with Blackwell GPUs. - Flash Attention 2 (
flash-attn) remains the recommended backend for Blackwell and ARM64 platforms. - If you see:
WARNING: No Flash Attention backend found, using pytorch SDPA implementation
→ Flash Attention was not installed correctly.
Install Flash Attention 2 from Source
# Clone the official repository
git clone --depth=1 https://github.com/Dao-AILab/flash-attention ./flash-attention
cd flash-attention
# Set build environment variables
export MAX_JOBS=16 # Limit parallel compilation jobs (adjust based on RAM/CPU)
export NVCC_THREADS=2 # Reduce NVCC thread count to avoid OOM during build
export FLASH_ATTENTION_FORCE_BUILD="TRUE"
# Install without build isolation for better compatibility with uv
uv pip install -v --no-build-isolation .
Build time may vary: ~15 minutes on high-end hardware. Use MAX_JOBS=4 on Jetson Thor and DGX Spark.
Start the inference server with the Qwen3-TTS model:
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
--stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
--omni \
--port 8091 \
--trust-remote-code \
--enforce-eager
Navigate to the example directory:
cd vllm-omni/examples/online_serving/qwen3_tts
Basic TTS Generation
python openai_speech_client.py \
--text "If you must run natively, you can attempt to build the package from source." \
--voice vivian \
--language English
Voice Cloning (Base Model)
python openai_speech_client.py \
--model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
--task-type Base \
--text "Hello, this is a cloned voice" \
--ref-audio /path/to/reference.wav \
--ref-text "Original transcript of the reference audio"
Expected output:
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/audio/speech, Methods: POST
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/audio/voices, Methods: GET
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/images/generations, Methods: POST
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/images/edits, Methods: POST
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/videos, Methods: POST
(APIServer pid=14365) INFO: Started server process [14365]
(APIServer pid=14365) INFO: Waiting for application startup.
(APIServer pid=14365) INFO: Application startup complete.
(APIServer pid=14365) INFO 02-21 16:25:04 [serving_speech.py:329] TTS speech request speech-93e6b197b84010fc: text='If you must run natively, you can attempt to build...', task_type=CustomVoice
(APIServer pid=14365) INFO 02-21 16:25:04 [async_omni.py:316] [AsyncOrchestrator] Entering scheduling loop: stages=2, final_stage=1
(Worker pid=15018) [Stage-1] INFO 02-21 16:25:07 [qwen3_tts_code2wav.py:183] Code2Wav codec: frames=25 q=16 uniq=356 range=[2,2021] head=[[1995, 1159, 355, 22, 1174, 1093, 625, 1814], [1028, 1800, 261, 826, 911, 1164, 1381, 1610]]
[Stage-1] WARNING 02-21 16:25:10 [output_processor.py:127] Error concatenating tensor for key sr; keeping last tensor
(APIServer pid=14365) INFO: 127.0.0.1:60310 - "POST /v1/audio/speech HTTP/1.1" 200 OK
This setup has been validated on DGX Spark with CUDA 13.0.