Sharing a working setup for real-time voice chat on DGX Spark using Microsoft’s VibeVoice-Realtime-0.5B TTS. Couldn’t find existing documentation for this combination, so documenting what worked.
Environment
- DGX Spark (GB10, CUDA 13.0, 128GB unified memory)
- Ubuntu 24.04 (DGX Spark Version 7.2.3)
- Python 3.11
Problem: PyTorch CUDA Not Available
A common issue on Spark - PyTorch may not have CUDA enabled:
$ python -c "import torch; print(torch.cuda.is_available())"
False
$ python -c "import torch; print(torch.__version__)"
2.9.0+cpu
Solution
Install PyTorch with CUDA 13 support from PyPI:
pip uninstall torch torchaudio torchvision -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
Verify:
$ python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}')"
CUDA: True, Device: NVIDIA GB10
VibeVoice Installation
cd ~/ggml-org
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
pip install -e .
Performance Results
Test command:
python demo/realtime_model_inference_from_file.py \
--model_path microsoft/VibeVoice-Realtime-0.5B \
--txt_path demo/text_examples/1p_vibevoice.txt
Results:
Generation time: 26.00 seconds
Audio duration: 53.73 seconds
RTF (Real Time Factor): 0.48x
The model generates audio 2x faster than real-time on the GB10.
Full Voice Pipeline
I built a streaming pipeline with:
- STT: whisper.cpp (large-v3-turbo, port 8025)
- LLM: Ollama llama3.2:3b (port 11434, streaming)
- TTS: VibeVoice-Realtime-0.5B (port 8027, streaming)
Key optimization: Sentence-level streaming between LLM and TTS. Buffer tokens until sentence boundary, then stream to TTS immediately while LLM continues. Achieves ~766ms to first audio.
Full code available: [GitHub link]
Notes
- The 0.5B Realtime model has 7 preset voices only (no voice cloning)
- For voice cloning, use the 1.5B model (higher latency but fits easily in 128GB)
- Flash Attention not required - falls back to SDPA which works fine
Hope this helps others getting started with voice AI on Spark.