xTTS in a Dockercontainer on the DGX Spark

Hey Guys,
would love to be able to talk to my ai and therefore I like to have a nice text to speech engine/ai running.
Similar to Alexa, Siri, Google Gemini, Chatgpt….J.A.R.V.I.S

Looking for a TTS Solution with low latency that supports multipal languages and voices and can easily be integrated in different applications like open web ui or home assistant.

The first recommendation I got from an ai is xtts.

I spend the last days trying to get XTTS run in a docker container on the DGX Spark without success.

It works slow and just for the first sentence on the CPU but not on the GPU.

docker-compose.yml.txt (753 Bytes)

Dockerfile.txt (4.8 KB)

With the help of an AI I put this togeather.

Has anyone successfully compiled FlashAttention or DeepSpeed kernels specifically for sm_120 on aarch64? Most current wheels only support up to sm_90.

  1. Is there a recommended TORCH_CUDA_ARCH_LIST specifically for Spark nodes? We are currently using 12.1a or 12.0, but we are seeing inconsistent JIT behavior.

  2. Is there an official NVIDIA-optimized fork of Coqui TTS that handles the precision requirements of Blackwell more gracefully than the unmaintained original repo?

The Issue: There is a “Hardware-Software Mismatch” when deploying XTTS-v2 on the new Blackwell architecture. Despite using the latest NGC containers, I face two catastrophic blockers:

1. Numerical Instability (The “Blackwell Noise” Artifact): When running inference, the model generates valid audio for ~2 seconds before devolving into full-scale white noise/static.

  • Hypothesis: The Blackwell Tensor Cores (sm_120) handling of FP16/BF16 precision causes a numerical underflow in torchaudio.functional.spectrogram calculations during silent frames, leading to NaNs that destabilize the HiFi-GAN vocoder.

  • Attempted Fix: Manually patching torchaudio to include an epsilon clamp (1e-5) in the normalization denominator.

2. Modern Transformer Import Deadlock: The legacy XTTS code relies on transformers.generation.utils which has moved in version 4.48+.

  • Error: ImportError: cannot import name 'BeamSearchScorer' from 'transformers'.

  • Context: The GenerationMixin class is no longer automatically mixed into GPT2-based models in the 2026 stack, causing AttributeError: 'GPT2InferenceModel' object has no attribute 'generate'.

3. Dependency Version Gate: The coqui-ai/Trainer dependency has a hard-coded check for Python < 3.12, which causes metadata-generation-failed during build on Ubuntu 24.04/Python 3.12.3.

SORRY THIS DOES NOT WORK TOO AS IT IS PROCESSED ON THE CPU AND NOT GPU
So far I have failed getting any TTS system to run on the Spark in a Docker container.

so far I wasted weeks trying these

here is one attempt

Hey everyone, I wanted to share my (with the help of AI Created) “Frankenstein” build. A current working version of a Dockerized XTTSv2 setup specifically optimized for **NVIDIA Blackwell

(sm_130/sm_121)** architectures. Since the original Coqui TTS repository is no longer actively maintained for modern dependencies, this setup uses several custom “hot-fix” patches to allow it to run on Python 3.12+, Transformers 5.0, and PyTorch 2.6+.

What’s inside?

  • Blackwell Optimization: Built against flashinfer and torchcodec for maximum inference speed on DGX Spark.
  • Dependency Shims: Patches setup.py and requirements.txt on the fly to bypass outdated Python version locks (<3.12) and strict pandas requirements.
  • Architectural Fixes: * Shims GPT2InferenceModel to work with modern GenerationMixin.
    • Fixes torchaudio.load issues by switching to soundfile for better memory alignment on H100/Blackwell.
    • Resolves the PyTorch 2.6 “Weights Only” pickle error.

0. Architecture Overview

This setup bridges the gap between the legacy Coqui code and the high-throughput requirements of Blackwell GPUs.

1. Directory Setup

Prepare your workspace by replacing YOUR_SPARKY_NAME with your actual system username.

mkdir -p /home/YOUR_SPARKY_NAME/Docker/xtts2/Build
mkdir -p /home/YOUR_SPARKY_NAME/Docker/xtts2/models
mkdir -p /home/YOUR_SPARKY_NAME/Docker/xtts2/speakers
cd /home/YOUR_SPARKY_NAME/Docker/xtts2/Build

2. Configuration Files

Create the following four files inside your /Build folder:

A. Dockerfile

This uses a high-speed uv installer and handles the Blackwell source builds for flashinfer.

Click to view Dockerfile

Dockerfile

# syntax=docker/dockerfile:1
# Base Image: Specialized for DGX Spark / Blackwell
FROM scitrera/dgx-spark-vllm:0.14.1-t5

ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
# Blackwell Native Arch sm_121 / sm_121a
# Using 13.0 as requested for broader Blackwell support
ENV TORCH_CUDA_ARCH_LIST="13.0"
ENV CUDA_HOME="/usr/local/cuda"

WORKDIR /app

# 1. System Basics & High-Speed Installer (uv + WheelNext)
# Ensure git and build tools are present
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    libsndfile1-dev \
    portaudio19-dev \
    ffmpeg \
    cmake \
    ninja-build \
    git \
    curl \
    pkg-config \
    && rm -rf /var/lib/apt/lists/*

# Install uv with WheelNext plugin
RUN curl -LsSf https://astral.sh/uv/install.sh | \
    INSTALLER_DOWNLOAD_URL=https://wheelnext.astral.sh/v0.0.2 sh

# Install Rust Toolchain for sudachipy
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y

ENV PATH="/root/.cargo/bin:${PATH}"

# 2. Hardware-Aware Dependency Check & Install
# We verify if flashinfer and torchcodec are present (from base image). 
# If not, we build them.
# Note: uv pip install --system is used to install into the global environment.

ENV BUILD_AGAINST_ALL_FFMPEG_FROM_S3=1
ENV I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1
ENV PYTHONWARNINGS=ignore::SyntaxWarning

# Install build-time dependencies explicitly
RUN uv pip install --system pybind11 Cython setuptools-rust

RUN echo "Checking pre-installed Blackwell dependencies..." && \
    if python3 -c "import flashinfer" 2>/dev/null; then \
    echo "FlashInfer already installed."; \
    else \
    echo "FlashInfer not found. Starting Source Build for Blackwell..."; \
    uv pip install --system "flashinfer>=0.6.2" || \
    ( \
    echo "Binary install failed. Cloning and building flashinfer..."; \
    git clone --recursive https://github.com/flashinfer-ai/flashinfer.git /tmp/flashinfer && \
    cd /tmp/flashinfer && \
    uv pip install --system --no-build-isolation . \
    ); \
    fi && \
    if python3 -c "import torchcodec" 2>/dev/null; then \
    echo "TorchCodec already installed."; \
    else \
    echo "TorchCodec not found. Starting Source Build..."; \
    uv pip install --system torchcodec || \
    ( \
    echo "Binary install failed. Cloning and building torchcodec..."; \
    git clone --recursive https://github.com/pytorch/torchcodec.git /tmp/torchcodec && \
    cd /tmp/torchcodec && \
    export pybind11_DIR=$(python3 -m pybind11 --cmakedir) && \
    python3 -m pip install --no-cache-dir --no-build-isolation . \
    ); \
    fi

# 3. Coqui TTS Source Build (Bypass Metadata-Lock)
RUN git clone https://github.com/coqui-ai/TTS.git /app/TTS
WORKDIR /app/TTS

# Unlock dependencies in setup.py to allow Python 3.12+ and modern Transformers
# Unlock dependencies in setup.py - Robust Patching using Python
COPY patch_setup.py /app/TTS/patch_setup.py
RUN python3 /app/TTS/patch_setup.py

# Pre-install modern pandas to avoid source build of legacy version
RUN uv pip install --system pandas

# Install via uv --system

# Install via uv --system
RUN uv pip install --system --no-build-isolation -e . && \
    uv pip install --system xtts-api-server

# 4. Apply The Blackwell Patch
COPY patch_xtts.py /app/patch_xtts.py
RUN python3 /app/patch_xtts.py

# 5. Blackwell Performance Tuning
ENV PYTORCH_ALLOC_CONF="expandable_segments:True"
ENV VLLM_ATTENTION_BACKEND="FLASHINFER"

# 6. Entrypoint
EXPOSE 8020
CMD ["python3", "-m", "xtts_api_server", "--listen", "-p", "8020", "-d", "cuda"]

Technical Deep Dive: Why the Patches?

If you’re wondering what the scripts are actually doing under the hood, here is the breakdown:

Script Purpose Why it’s needed
patch_setup.py Version Unlock Coqui’s original metadata hard-locks Python to $<3.12$. This script neutrally “neutralizes” those checks to allow installation on modern distros.
patch_xtts.py transformers 5.0 Shim Transformers 5.0 moved many generation utilities into sub-packages. This shim injects the correct headers so the XTTS streamer doesn’t crash on import.
patch_audio_loading Memory Alignment Blackwell GPUs ($sm_121$) can have strict alignment requirements. Replacing torchaudio.load with soundfile prevents potential bus errors during audio processing.
fix_torch_load Pickle Security Fix PyTorch 2.6+ defaults to weights_only=True. Since Coqui models use legacy pickles, this patch sets it to False so your weights actually load.

B. patch_setup.py

This script runs during the build to unlock Python 3.12 support.

Click to view patch_setup.py

patch_setup.py

import sys
import os
import re

print('--- SETUP.PY BEFORE PATCH ---')
if os.path.exists('setup.py'):
    initial_content = open('setup.py').read()
    # Print lines with pandas
    print([line for line in initial_content.split('\n') if 'pandas' in line])
    
    c = initial_content
    # Attempt to replace conditions
    c = c.replace('sys.version_info < (3, 9)', 'False')
    c = c.replace('sys.version_info >= (3, 12)', 'False')
    
    # Aggressively neutralize the specific error
    c = c.replace('raise RuntimeError("TTS requires python >= 3.9 and < 3.12', 'pass # raise RuntimeError("TTS requires python >= 3.9 and < 3.12')
    c = c.replace("raise RuntimeError('TTS requires python >= 3.9 and < 3.12", "pass # raise RuntimeError('TTS requires python >= 3.9 and < 3.12")
    
    c = c.replace('python_requires=">=3.9, <3.12"', 'python_requires=">=3.9"')
    c = c.replace("python_requires='>=3.9, <3.12'", "python_requires='>=3.9'")

    # Relax pandas constraint for Py3.12 using Regex to catch variations like "pandas<2.0.0" or "pandas<=1.5.3"
    # Matches "pandas" followed by any constraint chars until quote or end of line (simplified)
    # pandas>=1.4,<2.0 -> pandas
    c = re.sub(r'pandas[<>=!0-9,.]+', 'pandas', c)
    
    open('setup.py', 'w').write(c)
    
    print('--- SETUP.PY AFTER PATCH ---')
    print([line for line in open('setup.py').read().split('\n') if 'pandas' in line])
    print('Patched setup.py version checks')
else:
    print("WARNING: setup.py not found!")

# Also patch requirements.txt if it exists
if os.path.exists('requirements.txt'):
    print('--- REQUIREMENTS.TXT BEFORE PATCH ---')
    r = open('requirements.txt').read()
    print([line for line in r.split('\n') if 'pandas' in line])
    
    # Regex replace for requirements.txt
    r = re.sub(r'pandas[<>=!0-9,.]+', 'pandas', r)
    
    open('requirements.txt', 'w').write(r)
    print('Patched requirements.txt')
    print('--- REQUIREMENTS.TXT AFTER PATCH ---')
    print([line for line in open('requirements.txt').read().split('\n') if 'pandas' in line])
else:
    print("requirements.txt not found, skipping.")

C. patch_xtts.py

The “magic” sauce. It injects code into the installed libraries to fix compatibility with Transformers 5.0 and Blackwell audio alignment.

Click to view patch_xtts.py

patch_xtts.py

import os
import sys
import re
import site
from transformers.modeling_utils import PreTrainedModel

# --- Helper Function ---
def patch_file(file_path, pattern, replacement, count=0):
    """
    Applies a regex patch to a file.
    """
    if not os.path.exists(file_path):
        print(f"Skipping {file_path}: File not found.")
        return False

    try:
        with open(file_path, 'r') as f:
            content = f.read()

        # Check if already patched to avoid duplicates if possible, 
        # but regex replacement usually handles this if written carefully.
        # Check if replacement content is already largely present to avoid redundant writes
        # (Naive check)
        if replacement[:20] in content and len(replacement) > 20 and "lambda" not in replacement:
             # print(f"INFO: {file_path} seems already patched.")
             pass

        new_content, n = re.subn(pattern, replacement, content, count=count, flags=re.MULTILINE)
        
        if n > 0:
            if new_content != content:
                with open(file_path, 'w') as f:
                    f.write(new_content)
                print(f"SUCCESS: Patched {file_path} ({n} occurences)")
                return True
            else:
                 print(f"INFO: {file_path} matches but content unchanged (Already patched?)")
                 return True
        else:
            print(f"WARNING: Pattern not found in {file_path}")
            # print(f"Pattern: {pattern}") # Debug
            return False
            
    except Exception as e:
        print(f"ERROR: Failed to patch {file_path}: {e}")
        return False

# --- Patches ---

def shim_transformers():
    # Transformers 5.0 Compatibility Shim for Streaming Generator
    stream_file = '/app/TTS/TTS/tts/layers/xtts/stream_generator.py'
    if os.path.exists(stream_file):
        header = (
            "from transformers.generation.utils import GenerationMixin\n"
            "from transformers.generation.beam_search import BeamSearchScorer\n"
            "from transformers.generation.logits_process import LogitsProcessorList\n"
            "from transformers.generation.stopping_criteria import StoppingCriteriaList\n"
            "from transformers.generation.configuration_utils import GenerationConfig\n"
            "from transformers.modeling_utils import PreTrainedModel\n"
            "DisjunctiveConstraint = PhrasalConstraint = Constraint = ConstraintList = None\n"
        )
        
        # Simple read/write for this one as it's a full header injection
        with open(stream_file, 'r') as f:
            lines = f.readlines()
        
        # Check if already shimmed
        if "DisjunctiveConstraint =" in lines[6] if len(lines)>6 else False:
             print(f"INFO: {stream_file} already shimmed.")
             return

        with open(stream_file, 'w') as f:
            f.write(header)
            skip = False
            for line in lines:
                # Remove old imports that might fail
                if "from transformers import (" in line:
                    skip = True
                    continue
                if skip and ")" in line:
                    skip = False
                    continue
                if not skip:
                    f.write(line)
        print(f"SUCCESS: Shimmed Transformers 5.0 imports in {stream_file}")

def fix_torchaudio_math():
    # Fix math domain error in Torchaudio (previously disabled, keeping for structure)
    pass

def patch_audio_loading():
    # Blackwell/H100 Audio Loading Fix (Alignment)
    # Replaces torchaudio.load with soundfile
    file_path = '/app/TTS/TTS/tts/models/xtts.py' 
    if os.path.exists(file_path):
        patch_file(
            file_path,
            r'audio, sample_rate = torchaudio\.load\(audio_path\)',
            'import soundfile as sf\n        audio_sf, sample_rate = sf.read(audio_path)\n        audio = torch.from_numpy(audio_sf).float().unsqueeze(0)\n        # audio, sample_rate = torchaudio.load(audio_path)'
        )
    else:
        print(f"WARNING: {file_path} not found for audio loading patch")

def shim_gpt2_inference():
    # Shim GPT2InferenceModel to inherit from GenerationMixin
    target = '/app/TTS/TTS/tts/layers/xtts/gpt.py'
    patch_file(
        target,
        r'class GPT2InferenceModel\(nn\.Module\):',
        'class GPT2InferenceModel(nn.Module, GenerationMixin):'
    )

def block_xtts_update():
    # Block xtts-api-server from trying to "upgrade" TTS
    try:
        import xtts_api_server
        site_packages = os.path.dirname(xtts_api_server.__file__)
        target_file = os.path.join(site_packages, "modeldownloader.py")
        
        # Regex to disable the function body
        # Matches: def upgrade_tts_package(): (newline) (indent) code...
        # We find the definition and inject an early return.
        
        patch_file(
            target_file,
            r'def upgrade_tts_package\(\):',
            'def upgrade_tts_package():\n    print("PATCH: Auto-update disabled"); return'
        )
        
    except ImportError:
        print("WARNING: Could not import xtts_api_server to find path.")
    except Exception as e:
        print(f"ERROR: block_xtts_update failed: {e}")

def fix_torch_load():
    # Fix PyTorch 2.6 Weights Only Pickle Error
    target_file = "/app/TTS/TTS/utils/io.py"
    
    # Robust Regex for torch.load call
    # Matches: torch.load(f, map_location=map_location, **kwargs)
    # allowing for some whitespace variation
    pattern = r'torch\.load\(\s*f,\s*map_location=map_location,\s*\*\*kwargs\s*\)'
    replacement = 'torch.load(f, map_location=map_location, weights_only=False, **kwargs)'
    
    patch_file(target_file, pattern, replacement)

if __name__ == "__main__":
    print(f"Applying Blackwell patches... (Python {sys.version})")
    
    shim_transformers()
    patch_audio_loading()
    shim_gpt2_inference()
    block_xtts_update()
    fix_torch_load()
    
    print("All patches applied!")

D. docker-compose.yml

Configured with ipc: host and privileged: true to ensure NVLink performance on DGX systems.

Click to view docker-compose.yml

docker-compose.yml

version: "3.8"

services:
  xtts-api:
    image: coqui-xtts-for-dgx-spark
    container_name: xtts-voice-8020
    # Blackwell GPUs require high-speed IPC for NVLink
    ipc: host

    # Port Mapping: Host:Container
    # Note: 8020 is the default internal port for xtts-api-server
    ports:
      - "8020:8020"

    privileged: true

    environment:
      - PYTORCH_ALLOC_CONF=expandable_segments:True
      # vLLM/Transformers 5.0 Blackwell specific optimizations
      - VLLM_ATTENTION_BACKEND=FLASHINFER
      # GPU Selection
      - CUDA_VISIBLE_DEVICES=all
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
      # Coqui & Torch Settings
      - COQUI_TOS_AGREED=1
      - TORCH_ALLOW_ANY_PICKLE=1
      - TORCH_CUDNN_V8_API_ENABLED=1
      - XTTS_USE_DEEPSPEED=true

    # Legacy runtime support (optional, can be removed if using deploy block)
    runtime: nvidia

    restart: unless-stopped

    ulimits:
      memlock: -1
      stack: 67108864

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [ gpu ]

    # Keep container alive if running custom scripts
    tty: true
    stdin_open: true

    # Volume mappings for models and speakers
    # ENSURE THESE PATH EXIST ON THE HOST - YOU HAVE TO ADJUST IT TO YOUR NEEDS 
    volumes:
      - /home/YOUR_SPARKY_NAME/Docker/xtts2/models:/app/TTS/models
      - /home/YOUR_SPARKY_NAME/Docker/xtts2/speakers:/app/TTS/speakers

3. Installation & Usage

  1. Add Speaker Samples: Place .wav voice samples in your /speakers folder.
    you will have to download some WAV Voice Samples
    Top Voice Models: Over 27,900+ Unique AI RVC Models
    Explore Voice Clips for ElevenLabs Text to Speech AI

  2. Build the Image:

docker build -t coqui-xtts-for-dgx-spark .
  1. Launch:
docker-compose up -d
  1. Verify: Monitor the logs until you see the Uvicorn startup message:
docker logs -f xtts-voice-8020

Wait until you see
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8020 (Press CTRL+C to quit)

4. Testing the API

Once the server is up, test it with a simple curl command:

curl -X POST "http://localhost:8020/tts_to_audio/" \
    -H "Content-Type: application/json" \
    -d '{
        "text": "System operational. All systems green on NVIDIA Blackwell.",
        "speaker_wav": "YOUR_SAMPLE.wav",
        "language": "en"
    }' \
    --output test_blackwell.wav

Next Steps:

This is far from perfect, but it is a solid Version 1.0 that actually runs on the latest hardware. If anyone finds more optimizations for the VLLM_ATTENTION_BACKEND, please let me know! Took me several days to get it to work.

Its really slow - guess it still runs on the CPU and not on the GPU

ok after about 192 failed versions this is the conclusion so far (summarised with ai)

I’m currently struggling with a persistent linking issue while trying to deploy a Text-to-Speech stack (Coqui TTS/XTTS) on an NVIDIA DGX Spark node. Despite several workarounds, I’ve hit a wall with library resolution on the Blackwell architecture.

System Environment:

  • Hardware: DGX Spark (Grace ARM64 CPU + Blackwell GB10/SM121 GPU).
  • Base Image: vllm-dgx-spark-gb10 (optimized for Blackwell).
  • Python: 3.12.3
  • CUDA: 13.0 / 13.1

The Issue:

I am receiving a persistent OSError when importing torchaudio, even after installing the recommended cu130 wheels from the PyTorch index. The system cannot resolve the path to libtorch_cuda.so, even though the file is physically present in the site-packages.

Error Traceback:

PlaintextOSError: libtorch_cuda.so: cannot open shared object file: No such file or directory ... OSError: Could not load this library: /opt/vllm-env/lib/python3.12/site-packages/torchaudio/lib/libtorchaudio.so

What I’ve Attempted:

  1. Force Reinstall: Performed a clean uninstall of torch and torchaudio and re-installed using --index-url https://download.pytorch.org/whl/cu130.
  2. Environment Variables: Explicitly set LD_LIBRARY_PATH to include the torch/lib directory within the virtual environment.
  3. Linker Injection: Attempted to register the library paths via ldconfig and created physical symlinks in /usr/lib pointing to the .so files in the vllm-env.
  4. Hardware Stability Patches: Applied Epsilon-stability patches to torchaudio.functional to mitigate Blackwell-specific audio noise.

The Problem:

It appears that the torchaudio wheels available via the standard PyTorch index (even the cu130 variants) do not correctly map to the shared library structure of the specialized Blackwell environments on Grace-ARM64. Since NVIDIA has removed torchaudio from the standard PyTorch containers, there seems to be a “missing link” for audio-related tasks on this hardware.

Questions / The Ask:

  1. Is there an officially sanctioned NVIDIA build of torchaudio specifically for SM121 (Blackwell) on ARM64?
  2. How should shared library resolution be handled on the DGX Spark when working within a managed vLLM environment to prevent ctypes from losing track of libtorch_cuda.so?
  3. Are there plans to include audio-compute support back into the Blackwell NGC containers?

Any insights or specific library path configurations for the Grace/Blackwell combo would be greatly appreciated.