SORRY THIS DOES NOT WORK TOO AS IT IS PROCESSED ON THE CPU AND NOT GPU
So far I have failed getting any TTS system to run on the Spark in a Docker container.
so far I wasted weeks trying these
here is one attempt
Hey everyone, I wanted to share my (with the help of AI Created) “Frankenstein” build. A current working version of a Dockerized XTTSv2 setup specifically optimized for **NVIDIA Blackwell
(sm_130/sm_121)** architectures. Since the original Coqui TTS repository is no longer actively maintained for modern dependencies, this setup uses several custom “hot-fix” patches to allow it to run on Python 3.12+, Transformers 5.0, and PyTorch 2.6+.
What’s inside?
- Blackwell Optimization: Built against
flashinfer and torchcodec for maximum inference speed on DGX Spark.
- Dependency Shims: Patches
setup.py and requirements.txt on the fly to bypass outdated Python version locks (<3.12) and strict pandas requirements.
- Architectural Fixes: * Shims
GPT2InferenceModel to work with modern GenerationMixin.
- Fixes
torchaudio.load issues by switching to soundfile for better memory alignment on H100/Blackwell.
- Resolves the PyTorch 2.6 “Weights Only” pickle error.
0. Architecture Overview
This setup bridges the gap between the legacy Coqui code and the high-throughput requirements of Blackwell GPUs.
1. Directory Setup
Prepare your workspace by replacing YOUR_SPARKY_NAME with your actual system username.
mkdir -p /home/YOUR_SPARKY_NAME/Docker/xtts2/Build
mkdir -p /home/YOUR_SPARKY_NAME/Docker/xtts2/models
mkdir -p /home/YOUR_SPARKY_NAME/Docker/xtts2/speakers
cd /home/YOUR_SPARKY_NAME/Docker/xtts2/Build
2. Configuration Files
Create the following four files inside your /Build folder:
A. Dockerfile
This uses a high-speed uv installer and handles the Blackwell source builds for flashinfer.
Click to view Dockerfile
Dockerfile
# syntax=docker/dockerfile:1
# Base Image: Specialized for DGX Spark / Blackwell
FROM scitrera/dgx-spark-vllm:0.14.1-t5
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
# Blackwell Native Arch sm_121 / sm_121a
# Using 13.0 as requested for broader Blackwell support
ENV TORCH_CUDA_ARCH_LIST="13.0"
ENV CUDA_HOME="/usr/local/cuda"
WORKDIR /app
# 1. System Basics & High-Speed Installer (uv + WheelNext)
# Ensure git and build tools are present
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
libsndfile1-dev \
portaudio19-dev \
ffmpeg \
cmake \
ninja-build \
git \
curl \
pkg-config \
&& rm -rf /var/lib/apt/lists/*
# Install uv with WheelNext plugin
RUN curl -LsSf https://astral.sh/uv/install.sh | \
INSTALLER_DOWNLOAD_URL=https://wheelnext.astral.sh/v0.0.2 sh
# Install Rust Toolchain for sudachipy
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"
# 2. Hardware-Aware Dependency Check & Install
# We verify if flashinfer and torchcodec are present (from base image).
# If not, we build them.
# Note: uv pip install --system is used to install into the global environment.
ENV BUILD_AGAINST_ALL_FFMPEG_FROM_S3=1
ENV I_CONFIRM_THIS_IS_NOT_A_LICENSE_VIOLATION=1
ENV PYTHONWARNINGS=ignore::SyntaxWarning
# Install build-time dependencies explicitly
RUN uv pip install --system pybind11 Cython setuptools-rust
RUN echo "Checking pre-installed Blackwell dependencies..." && \
if python3 -c "import flashinfer" 2>/dev/null; then \
echo "FlashInfer already installed."; \
else \
echo "FlashInfer not found. Starting Source Build for Blackwell..."; \
uv pip install --system "flashinfer>=0.6.2" || \
( \
echo "Binary install failed. Cloning and building flashinfer..."; \
git clone --recursive https://github.com/flashinfer-ai/flashinfer.git /tmp/flashinfer && \
cd /tmp/flashinfer && \
uv pip install --system --no-build-isolation . \
); \
fi && \
if python3 -c "import torchcodec" 2>/dev/null; then \
echo "TorchCodec already installed."; \
else \
echo "TorchCodec not found. Starting Source Build..."; \
uv pip install --system torchcodec || \
( \
echo "Binary install failed. Cloning and building torchcodec..."; \
git clone --recursive https://github.com/pytorch/torchcodec.git /tmp/torchcodec && \
cd /tmp/torchcodec && \
export pybind11_DIR=$(python3 -m pybind11 --cmakedir) && \
python3 -m pip install --no-cache-dir --no-build-isolation . \
); \
fi
# 3. Coqui TTS Source Build (Bypass Metadata-Lock)
RUN git clone https://github.com/coqui-ai/TTS.git /app/TTS
WORKDIR /app/TTS
# Unlock dependencies in setup.py to allow Python 3.12+ and modern Transformers
# Unlock dependencies in setup.py - Robust Patching using Python
COPY patch_setup.py /app/TTS/patch_setup.py
RUN python3 /app/TTS/patch_setup.py
# Pre-install modern pandas to avoid source build of legacy version
RUN uv pip install --system pandas
# Install via uv --system
# Install via uv --system
RUN uv pip install --system --no-build-isolation -e . && \
uv pip install --system xtts-api-server
# 4. Apply The Blackwell Patch
COPY patch_xtts.py /app/patch_xtts.py
RUN python3 /app/patch_xtts.py
# 5. Blackwell Performance Tuning
ENV PYTORCH_ALLOC_CONF="expandable_segments:True"
ENV VLLM_ATTENTION_BACKEND="FLASHINFER"
# 6. Entrypoint
EXPOSE 8020
CMD ["python3", "-m", "xtts_api_server", "--listen", "-p", "8020", "-d", "cuda"]
Technical Deep Dive: Why the Patches?
If you’re wondering what the scripts are actually doing under the hood, here is the breakdown:
| Script |
Purpose |
Why it’s needed |
| patch_setup.py |
Version Unlock |
Coqui’s original metadata hard-locks Python to $<3.12$. This script neutrally “neutralizes” those checks to allow installation on modern distros. |
| patch_xtts.py |
transformers 5.0 Shim |
Transformers 5.0 moved many generation utilities into sub-packages. This shim injects the correct headers so the XTTS streamer doesn’t crash on import. |
| patch_audio_loading |
Memory Alignment |
Blackwell GPUs ($sm_121$) can have strict alignment requirements. Replacing torchaudio.load with soundfile prevents potential bus errors during audio processing. |
| fix_torch_load |
Pickle Security Fix |
PyTorch 2.6+ defaults to weights_only=True. Since Coqui models use legacy pickles, this patch sets it to False so your weights actually load. |
B. patch_setup.py
This script runs during the build to unlock Python 3.12 support.
Click to view patch_setup.py
patch_setup.py
import sys
import os
import re
print('--- SETUP.PY BEFORE PATCH ---')
if os.path.exists('setup.py'):
initial_content = open('setup.py').read()
# Print lines with pandas
print([line for line in initial_content.split('\n') if 'pandas' in line])
c = initial_content
# Attempt to replace conditions
c = c.replace('sys.version_info < (3, 9)', 'False')
c = c.replace('sys.version_info >= (3, 12)', 'False')
# Aggressively neutralize the specific error
c = c.replace('raise RuntimeError("TTS requires python >= 3.9 and < 3.12', 'pass # raise RuntimeError("TTS requires python >= 3.9 and < 3.12')
c = c.replace("raise RuntimeError('TTS requires python >= 3.9 and < 3.12", "pass # raise RuntimeError('TTS requires python >= 3.9 and < 3.12")
c = c.replace('python_requires=">=3.9, <3.12"', 'python_requires=">=3.9"')
c = c.replace("python_requires='>=3.9, <3.12'", "python_requires='>=3.9'")
# Relax pandas constraint for Py3.12 using Regex to catch variations like "pandas<2.0.0" or "pandas<=1.5.3"
# Matches "pandas" followed by any constraint chars until quote or end of line (simplified)
# pandas>=1.4,<2.0 -> pandas
c = re.sub(r'pandas[<>=!0-9,.]+', 'pandas', c)
open('setup.py', 'w').write(c)
print('--- SETUP.PY AFTER PATCH ---')
print([line for line in open('setup.py').read().split('\n') if 'pandas' in line])
print('Patched setup.py version checks')
else:
print("WARNING: setup.py not found!")
# Also patch requirements.txt if it exists
if os.path.exists('requirements.txt'):
print('--- REQUIREMENTS.TXT BEFORE PATCH ---')
r = open('requirements.txt').read()
print([line for line in r.split('\n') if 'pandas' in line])
# Regex replace for requirements.txt
r = re.sub(r'pandas[<>=!0-9,.]+', 'pandas', r)
open('requirements.txt', 'w').write(r)
print('Patched requirements.txt')
print('--- REQUIREMENTS.TXT AFTER PATCH ---')
print([line for line in open('requirements.txt').read().split('\n') if 'pandas' in line])
else:
print("requirements.txt not found, skipping.")
C. patch_xtts.py
The “magic” sauce. It injects code into the installed libraries to fix compatibility with Transformers 5.0 and Blackwell audio alignment.
Click to view patch_xtts.py
patch_xtts.py
import os
import sys
import re
import site
from transformers.modeling_utils import PreTrainedModel
# --- Helper Function ---
def patch_file(file_path, pattern, replacement, count=0):
"""
Applies a regex patch to a file.
"""
if not os.path.exists(file_path):
print(f"Skipping {file_path}: File not found.")
return False
try:
with open(file_path, 'r') as f:
content = f.read()
# Check if already patched to avoid duplicates if possible,
# but regex replacement usually handles this if written carefully.
# Check if replacement content is already largely present to avoid redundant writes
# (Naive check)
if replacement[:20] in content and len(replacement) > 20 and "lambda" not in replacement:
# print(f"INFO: {file_path} seems already patched.")
pass
new_content, n = re.subn(pattern, replacement, content, count=count, flags=re.MULTILINE)
if n > 0:
if new_content != content:
with open(file_path, 'w') as f:
f.write(new_content)
print(f"SUCCESS: Patched {file_path} ({n} occurences)")
return True
else:
print(f"INFO: {file_path} matches but content unchanged (Already patched?)")
return True
else:
print(f"WARNING: Pattern not found in {file_path}")
# print(f"Pattern: {pattern}") # Debug
return False
except Exception as e:
print(f"ERROR: Failed to patch {file_path}: {e}")
return False
# --- Patches ---
def shim_transformers():
# Transformers 5.0 Compatibility Shim for Streaming Generator
stream_file = '/app/TTS/TTS/tts/layers/xtts/stream_generator.py'
if os.path.exists(stream_file):
header = (
"from transformers.generation.utils import GenerationMixin\n"
"from transformers.generation.beam_search import BeamSearchScorer\n"
"from transformers.generation.logits_process import LogitsProcessorList\n"
"from transformers.generation.stopping_criteria import StoppingCriteriaList\n"
"from transformers.generation.configuration_utils import GenerationConfig\n"
"from transformers.modeling_utils import PreTrainedModel\n"
"DisjunctiveConstraint = PhrasalConstraint = Constraint = ConstraintList = None\n"
)
# Simple read/write for this one as it's a full header injection
with open(stream_file, 'r') as f:
lines = f.readlines()
# Check if already shimmed
if "DisjunctiveConstraint =" in lines[6] if len(lines)>6 else False:
print(f"INFO: {stream_file} already shimmed.")
return
with open(stream_file, 'w') as f:
f.write(header)
skip = False
for line in lines:
# Remove old imports that might fail
if "from transformers import (" in line:
skip = True
continue
if skip and ")" in line:
skip = False
continue
if not skip:
f.write(line)
print(f"SUCCESS: Shimmed Transformers 5.0 imports in {stream_file}")
def fix_torchaudio_math():
# Fix math domain error in Torchaudio (previously disabled, keeping for structure)
pass
def patch_audio_loading():
# Blackwell/H100 Audio Loading Fix (Alignment)
# Replaces torchaudio.load with soundfile
file_path = '/app/TTS/TTS/tts/models/xtts.py'
if os.path.exists(file_path):
patch_file(
file_path,
r'audio, sample_rate = torchaudio\.load\(audio_path\)',
'import soundfile as sf\n audio_sf, sample_rate = sf.read(audio_path)\n audio = torch.from_numpy(audio_sf).float().unsqueeze(0)\n # audio, sample_rate = torchaudio.load(audio_path)'
)
else:
print(f"WARNING: {file_path} not found for audio loading patch")
def shim_gpt2_inference():
# Shim GPT2InferenceModel to inherit from GenerationMixin
target = '/app/TTS/TTS/tts/layers/xtts/gpt.py'
patch_file(
target,
r'class GPT2InferenceModel\(nn\.Module\):',
'class GPT2InferenceModel(nn.Module, GenerationMixin):'
)
def block_xtts_update():
# Block xtts-api-server from trying to "upgrade" TTS
try:
import xtts_api_server
site_packages = os.path.dirname(xtts_api_server.__file__)
target_file = os.path.join(site_packages, "modeldownloader.py")
# Regex to disable the function body
# Matches: def upgrade_tts_package(): (newline) (indent) code...
# We find the definition and inject an early return.
patch_file(
target_file,
r'def upgrade_tts_package\(\):',
'def upgrade_tts_package():\n print("PATCH: Auto-update disabled"); return'
)
except ImportError:
print("WARNING: Could not import xtts_api_server to find path.")
except Exception as e:
print(f"ERROR: block_xtts_update failed: {e}")
def fix_torch_load():
# Fix PyTorch 2.6 Weights Only Pickle Error
target_file = "/app/TTS/TTS/utils/io.py"
# Robust Regex for torch.load call
# Matches: torch.load(f, map_location=map_location, **kwargs)
# allowing for some whitespace variation
pattern = r'torch\.load\(\s*f,\s*map_location=map_location,\s*\*\*kwargs\s*\)'
replacement = 'torch.load(f, map_location=map_location, weights_only=False, **kwargs)'
patch_file(target_file, pattern, replacement)
if __name__ == "__main__":
print(f"Applying Blackwell patches... (Python {sys.version})")
shim_transformers()
patch_audio_loading()
shim_gpt2_inference()
block_xtts_update()
fix_torch_load()
print("All patches applied!")
D. docker-compose.yml
Configured with ipc: host and privileged: true to ensure NVLink performance on DGX systems.
Click to view docker-compose.yml
docker-compose.yml
version: "3.8"
services:
xtts-api:
image: coqui-xtts-for-dgx-spark
container_name: xtts-voice-8020
# Blackwell GPUs require high-speed IPC for NVLink
ipc: host
# Port Mapping: Host:Container
# Note: 8020 is the default internal port for xtts-api-server
ports:
- "8020:8020"
privileged: true
environment:
- PYTORCH_ALLOC_CONF=expandable_segments:True
# vLLM/Transformers 5.0 Blackwell specific optimizations
- VLLM_ATTENTION_BACKEND=FLASHINFER
# GPU Selection
- CUDA_VISIBLE_DEVICES=all
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
# Coqui & Torch Settings
- COQUI_TOS_AGREED=1
- TORCH_ALLOW_ANY_PICKLE=1
- TORCH_CUDNN_V8_API_ENABLED=1
- XTTS_USE_DEEPSPEED=true
# Legacy runtime support (optional, can be removed if using deploy block)
runtime: nvidia
restart: unless-stopped
ulimits:
memlock: -1
stack: 67108864
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [ gpu ]
# Keep container alive if running custom scripts
tty: true
stdin_open: true
# Volume mappings for models and speakers
# ENSURE THESE PATH EXIST ON THE HOST - YOU HAVE TO ADJUST IT TO YOUR NEEDS
volumes:
- /home/YOUR_SPARKY_NAME/Docker/xtts2/models:/app/TTS/models
- /home/YOUR_SPARKY_NAME/Docker/xtts2/speakers:/app/TTS/speakers
3. Installation & Usage
-
Add Speaker Samples: Place .wav voice samples in your /speakers folder.
you will have to download some WAV Voice Samples
Top Voice Models: Over 27,900+ Unique AI RVC Models
Explore Voice Clips for ElevenLabs Text to Speech AI
-
Build the Image:
docker build -t coqui-xtts-for-dgx-spark .
- Launch:
docker-compose up -d
- Verify: Monitor the logs until you see the Uvicorn startup message:
docker logs -f xtts-voice-8020
Wait until you see
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8020 (Press CTRL+C to quit)
4. Testing the API
Once the server is up, test it with a simple curl command:
curl -X POST "http://localhost:8020/tts_to_audio/" \
-H "Content-Type: application/json" \
-d '{
"text": "System operational. All systems green on NVIDIA Blackwell.",
"speaker_wav": "YOUR_SAMPLE.wav",
"language": "en"
}' \
--output test_blackwell.wav
Next Steps:
This is far from perfect, but it is a solid Version 1.0 that actually runs on the latest hardware. If anyone finds more optimizations for the VLLM_ATTENTION_BACKEND, please let me know! Took me several days to get it to work.