Running vLLM-Omni for Qwen3-TTS(voice design, voice clone) on DGX Spark

Hello,

This guide walks through setting up vllm-omni for running the Qwen3-TTS model family across different NVIDIA hardware platforms. Source: Qwen3-TTS - vLLM-Omni

Create and activate a dedicated virtual environment using uv

uv venv .vllm --python 3.12
source .vllm/bin/activate

Install System Dependencies

Audio processing requires ffmpeg and sox :

sudo apt-get update
sudo apt-get install ffmpeg sox -y

Install vLLM (Platform-Specific)

For x86_64 Machines (CUDA 13.0)

uv pip install \
 https://github.com/vllm-project/vllm/releases/download/v0.16.0/vllm-0.16.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl \
  --extra-index-url https://download.pytorch.org/whl/cu130 \
  --index-strategy unsafe-best-match

For ARM64 Platforms: DGX Spark & Jetson Thor (CUDA 13.0)


uv pip install \
  https://github.com/vllm-project/vllm/releases/download/v0.16.0/vllm-0.16.0+cu130-cp38-abi3-manylinux_2_35_aarch64.whl \
  --extra-index-url https://download.pytorch.org/whl/cu130 \
  --index-strategy unsafe-best-match

Build vLLM-Omni from Source

Required for latest features and custom modifications:

git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni

The fa3-fwd package does not provide aarch64 wheels. If you’re on DGX Spark or Jetson Thor:

  1. Edit vllm-omni/requirements/cuda.txt
  2. Remove or comment out: fa3-fwd==0.0.2

Then install the package in editable mode:

uv pip install -e .

Install Flash Attention

  • Flash Attention 3 (fa3-fwd ) is not compatible with Blackwell GPUs.
  • Flash Attention 2 (flash-attn ) remains the recommended backend for Blackwell and ARM64 platforms.
  • If you see:
    WARNING: No Flash Attention backend found, using pytorch SDPA implementation
    → Flash Attention was not installed correctly.

Install Flash Attention 2 from Source

# Clone the official repository
git clone --depth=1 https://github.com/Dao-AILab/flash-attention ./flash-attention
cd flash-attention

# Set build environment variables
export MAX_JOBS=16          # Limit parallel compilation jobs (adjust based on RAM/CPU)
export NVCC_THREADS=2       # Reduce NVCC thread count to avoid OOM during build
export FLASH_ATTENTION_FORCE_BUILD="TRUE"

# Install without build isolation for better compatibility with uv
uv pip install -v --no-build-isolation .

Build time may vary: ~15 minutes on high-end hardware. Use MAX_JOBS=4 on Jetson Thor and DGX Spark.

Start the inference server with the Qwen3-TTS model:

vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
    --omni \
    --port 8091 \
    --trust-remote-code \
    --enforce-eager

Navigate to the example directory:

cd vllm-omni/examples/online_serving/qwen3_tts

Basic TTS Generation

python openai_speech_client.py \
    --text "If you must run natively, you can attempt to build the package from source." \
    --voice vivian \
    --language English

Voice Cloning (Base Model)

python openai_speech_client.py \
    --model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
    --task-type Base \
    --text "Hello, this is a cloned voice" \
    --ref-audio /path/to/reference.wav \
    --ref-text "Original transcript of the reference audio"

Expected output:

(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/audio/speech, Methods: POST
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/audio/voices, Methods: GET
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/images/generations, Methods: POST
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/images/edits, Methods: POST
(APIServer pid=14365) INFO 02-21 16:24:29 [launcher.py:47] Route: /v1/videos, Methods: POST
(APIServer pid=14365) INFO:     Started server process [14365]
(APIServer pid=14365) INFO:     Waiting for application startup.
(APIServer pid=14365) INFO:     Application startup complete.
(APIServer pid=14365) INFO 02-21 16:25:04 [serving_speech.py:329] TTS speech request speech-93e6b197b84010fc: text='If you must run natively, you can attempt to build...', task_type=CustomVoice
(APIServer pid=14365) INFO 02-21 16:25:04 [async_omni.py:316] [AsyncOrchestrator] Entering scheduling loop: stages=2, final_stage=1
(Worker pid=15018) [Stage-1] INFO 02-21 16:25:07 [qwen3_tts_code2wav.py:183] Code2Wav codec: frames=25 q=16 uniq=356 range=[2,2021] head=[[1995, 1159, 355, 22, 1174, 1093, 625, 1814], [1028, 1800, 261, 826, 911, 1164, 1381, 1610]]
[Stage-1] WARNING 02-21 16:25:10 [output_processor.py:127] Error concatenating tensor for key sr; keeping last tensor
(APIServer pid=14365) INFO:     127.0.0.1:60310 - "POST /v1/audio/speech HTTP/1.1" 200 OK

This setup has been validated on DGX Spark with CUDA 13.0.

Nice, thanks for sharing it @shahizat

Thanks for sharing, I will move this to GB10 Projects

Thanks!

I’m on Nvidia Spark, and I want to drop by to say that it stops with errors. I commented out fa3-fwd==0.0.2. It fails on the ‘Install Flash Attention’ step with the command uv pip install -v --no-build-isolation .. This is the error I get:

~/repos/vllm-omni/vllm-omni/flash-attention/build/temp.linux-aarch64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_fp16_sm80.o
      -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__
      -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__
      --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3
      -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__
      -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__
      --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math
      -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90
      -gencode arch=compute_100f,code=sm_100 -gencode
      arch=compute_120f,code=sm_120 -gencode arch=compute_110f,code=sm_110
      -gencode arch=compute_120,code=compute_120 --threads 2
      -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
      ninja: build stopped: subcommand failed.

      [stderr]
      ~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/_vendor/wheel/bdist_wheel.py:4:
      FutureWarning: The 'wheel' package is no longer the canonical location
      of the 'bdist_wheel' command, and will be removed in a future release.
      Please update to setuptools v70.1 or later which contains an integrated
      version of this command.
        warn(
      ~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/dist.py:765:
      SetuptoolsDeprecationWarning: License classifiers are deprecated.
      !!

      
      ********************************************************************************
              Please consider removing the following classifiers in favor of a
      SPDX license expression:

              License :: OSI Approved :: BSD License

              See
      https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license
      for details.
      
      ********************************************************************************

      !!
        self._finalize_license_expression()
      W0223 12:22:44.547000 1072119 torch/utils/cpp_extension.py:535] There are
      no aarch64-linux-gnu-g++ version bounds defined for CUDA version 13.0
      Traceback (most recent call last):
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py",
      line 2693, in _run_ninja_build
          subprocess.run(
        File "/usr/lib/python3.12/subprocess.py", line 571, in run
          raise CalledProcessError(retcode, process.args,
      subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '16']'
      returned non-zero exit status 255.

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "<string>", line 11, in <module>
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/build_meta.py",
      line 439, in build_wheel
          return _build(['bdist_wheel', '--dist-info-dir',
      str(metadata_directory)])
      
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/build_meta.py",
      line 427, in _build
          return self._build_with_temp_dir(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/build_meta.py",
      line 408, in _build_with_temp_dir
          self.run_setup()
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/build_meta.py",
      line 518, in run_setup
          super().run_setup(setup_script=setup_script)
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/build_meta.py",
      line 317, in run_setup
          exec(code, locals())
        File "<string>", line 596, in <module>
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/__init__.py",
      line 117, in setup
          return distutils.core.setup(**attrs)  # type: ignore[return-value]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/_distutils/core.py",
      line 186, in setup
          return run_commands(dist)
                 ^^^^^^^^^^^^^^^^^^
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/_distutils/core.py",
      line 202, in run_commands
          dist.run_commands()
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/_distutils/dist.py",
      line 1002, in run_commands
          self.run_command(cmd)
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/dist.py",
      line 1107, in run_command
          super().run_command(command)
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/_distutils/dist.py",
      line 1021, in run_command
          cmd_obj.run()
        File "<string>", line 543, in run
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/command/bdist_wheel.py",
      line 370, in run
          self.run_command("build")
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/_distutils/cmd.py",
      line 357, in run_command
          self.distribution.run_command(command)
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/dist.py",
      line 1107, in run_command
          super().run_command(command)
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/_distutils/dist.py",
      line 1021, in run_command
          cmd_obj.run()
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/_distutils/command/build.py",
      line 135, in run
          self.run_command(cmd_name)
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/_distutils/cmd.py",
      line 357, in run_command
          self.distribution.run_command(command)
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/dist.py",
      line 1107, in run_command
          super().run_command(command)
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/_distutils/dist.py",
      line 1021, in run_command
          cmd_obj.run()
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/command/build_ext.py",
      line 97, in run
          _build_ext.run(self)
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py",
      line 368, in run
          self.build_extensions()
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py",
      line 1143, in build_extensions
          build_ext.build_extensions(self)
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py",
      line 484, in build_extensions
          self._build_extensions_serial()
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py",
      line 510, in _build_extensions_serial
          self.build_extension(ext)
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/command/build_ext.py",
      line 262, in build_extension
          _build_ext.build_extension(self, ext)
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py",
      line 565, in build_extension
          objects = self.compiler.compile(
                    ^^^^^^^^^^^^^^^^^^^^^^
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py",
      line 900, in unix_wrap_ninja_compile
          _write_ninja_file_and_compile_objects(
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py",
      line 2297, in _write_ninja_file_and_compile_objects
          _run_ninja_build(
        File
      "~/repos/vllm-omni/.venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py",
      line 2710, in _run_ninja_build
          raise RuntimeError(message) from e
      RuntimeError: Error compiling objects for extension

      hint: This usually indicates a problem with the package or the build
      environment.
DEBUG Released lock at `~/repos/vllm-omni/.venv/.lock`
DEBUG Released lock at `~/.cache/uv/.lock`

As a side note, I don’t know if it’s relevant or matters, but it also fails/ignores first few steps of the 70+ step process. e.g. this is the first one:

~/repos/vllm-omni/.venv/lib/python3.12/site-packages/torch/include/torch/csrc/python_headers.h:13:10: fatal error: Python.h: No such file or directory
DEBUG    13 | #include <Python.h>
DEBUG       |          ^~~~~~~~~~
DEBUG compilation terminated.

Since it’s torch, I feel it should matter, but the installer keeps going.

Thanks for this thread.

Just a few notes while I’m working through it.

Since vllm 0.16.0 is pre-release, I just used the below command instead:

uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly # add variant subdirectory here if needed

I have the same issue where installing flash attention failed. Somehow getting flash attention for DGX Spark is insanely difficult. I’m trying to build it

uv pip install -e . --no-build-isolation

Not sure if it works. Hopefully will give someone some idea to fix the probelm.

Use these env variables on DGX Spark, otherwise, it may lead to OOM situation.

Hello, try to run:

sudo apt install python3-dev python3.12-dev

Hello, this might be even better way: GitHub - andimarafioti/faster-qwen3-tts: Real-time text-to-speech with Qwen3-TTS

There is another, alternative solution from a different user here: