[DGX Spark] VibeVoice TTS + Streaming Voice Pipeline - Setup Guide

logosflux · January 4, 2026, 4:45pm

Sharing a working setup for real-time voice chat on DGX Spark using Microsoft’s VibeVoice-Realtime-0.5B TTS. Couldn’t find existing documentation for this combination, so documenting what worked.

Environment

DGX Spark (GB10, CUDA 13.0, 128GB unified memory)
Ubuntu 24.04 (DGX Spark Version 7.2.3)
Python 3.11

Problem: PyTorch CUDA Not Available

A common issue on Spark - PyTorch may not have CUDA enabled:

$ python -c "import torch; print(torch.cuda.is_available())"
False
$ python -c "import torch; print(torch.__version__)"
2.9.0+cpu

Solution

Install PyTorch with CUDA 13 support from PyPI:

pip uninstall torch torchaudio torchvision -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

Verify:

$ python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}')"
CUDA: True, Device: NVIDIA GB10

VibeVoice Installation

cd ~/ggml-org
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
pip install -e .

Performance Results

Test command:

python demo/realtime_model_inference_from_file.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --txt_path demo/text_examples/1p_vibevoice.txt

Results:

Generation time: 26.00 seconds
Audio duration: 53.73 seconds
RTF (Real Time Factor): 0.48x

The model generates audio 2x faster than real-time on the GB10.

Full Voice Pipeline

I built a streaming pipeline with:

STT: whisper.cpp (large-v3-turbo, port 8025)
LLM: Ollama llama3.2:3b (port 11434, streaming)
TTS: VibeVoice-Realtime-0.5B (port 8027, streaming)

Key optimization: Sentence-level streaming between LLM and TTS. Buffer tokens until sentence boundary, then stream to TTS immediately while LLM continues. Achieves ~766ms to first audio.

Full code available: [GitHub link]

Notes

The 0.5B Realtime model has 7 preset voices only (no voice cloning)
For voice cloning, use the 1.5B model (higher latency but fits easily in 128GB)
Flash Attention not required - falls back to SDPA which works fine

Hope this helps others getting started with voice AI on Spark.

Topic		Replies	Views
Support for Qwen3-TTS on DGX Spark (GB10) \| torchaudio installation failure on ARM64 DGX Spark / GB10 Projects pytorch	22	1908	May 18, 2026
xTTS in a Dockercontainer on the DGX Spark DGX Spark / GB10 Projects docker	9	1128	June 12, 2026
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed Forum Feedback	1	62	April 14, 2026
Running vLLM-Omni for Qwen3-TTS(voice design, voice clone) on DGX Spark DGX Spark / GB10 Projects	8	2666	April 14, 2026
Can I downgrade DGX SPARK to cuda 12? DGX Spark / GB10	5	1159	October 19, 2025
Three times ( VoiceClone \| VoiceDesign \| CustomVoice ) - Faster-Qwen3-TTS for NVIDIA DGX Spark (GB10) DGX Spark / GB10 Projects docker , spark , llm , speech , llama , dgx	54	1907	June 26, 2026
Running Parakeet speech to text on Spark DGX Spark / GB10 nim	28	2118	April 3, 2026
Effective PyTorch and CUDA DGX Spark / GB10 cudnn	23	11039	January 12, 2026
Running whisper.cpp STT server on DGX Spark (GB10, ARM64 + CUDA 13) via Docker DGX Spark / GB10 docker	4	365	June 1, 2026
Running Kokoro TTS on NVIDIA DGX Spark (ARM64/GB10) DGX Spark / GB10 Projects docker , jetson	0	323	May 3, 2026