[DGX Spark] VibeVoice TTS + Streaming Voice Pipeline - Setup Guide

Sharing a working setup for real-time voice chat on DGX Spark using Microsoft’s VibeVoice-Realtime-0.5B TTS. Couldn’t find existing documentation for this combination, so documenting what worked.

Environment

  • DGX Spark (GB10, CUDA 13.0, 128GB unified memory)
  • Ubuntu 24.04 (DGX Spark Version 7.2.3)
  • Python 3.11

Problem: PyTorch CUDA Not Available

A common issue on Spark - PyTorch may not have CUDA enabled:

$ python -c "import torch; print(torch.cuda.is_available())"
False
$ python -c "import torch; print(torch.__version__)"
2.9.0+cpu

Solution

Install PyTorch with CUDA 13 support from PyPI:

pip uninstall torch torchaudio torchvision -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

Verify:

$ python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}')"
CUDA: True, Device: NVIDIA GB10

VibeVoice Installation

cd ~/ggml-org
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
pip install -e .

Performance Results

Test command:

python demo/realtime_model_inference_from_file.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --txt_path demo/text_examples/1p_vibevoice.txt

Results:

Generation time: 26.00 seconds
Audio duration: 53.73 seconds
RTF (Real Time Factor): 0.48x

The model generates audio 2x faster than real-time on the GB10.

Full Voice Pipeline

I built a streaming pipeline with:

  • STT: whisper.cpp (large-v3-turbo, port 8025)
  • LLM: Ollama llama3.2:3b (port 11434, streaming)
  • TTS: VibeVoice-Realtime-0.5B (port 8027, streaming)

Key optimization: Sentence-level streaming between LLM and TTS. Buffer tokens until sentence boundary, then stream to TTS immediately while LLM continues. Achieves ~766ms to first audio.

Full code available: [GitHub link]

Notes

  • The 0.5B Realtime model has 7 preset voices only (no voice cloning)
  • For voice cloning, use the 1.5B model (higher latency but fits easily in 128GB)
  • Flash Attention not required - falls back to SDPA which works fine

Hope this helps others getting started with voice AI on Spark.

2 Likes