Best Mix of Models/Services on a Single Spark?

jesse.slone · February 1, 2026, 7:34pm

Thank you all for the work and discussion from the community, it’s been very helpful in improving the usability of my spark. At the moment, I do a lot of spinning up and down containers, reconnecting to my spark to run demos for people I’m working with, etc.

I’d like to us the spark in conjunction with a pi5 or other lightweight hardware to work on systems that can scale when run in the cloud and pointed at standard APIs, or be used in small scale cases where privacy and data ownership are a concern.

In your experience working with the spark so far, how would you try to achieve:

API endpoint to enable a chat interface (like with Open-WebUI, LibreChat) with preferably with vision capability, and preferably some prompt injection security
API endpoint for various applications that use text generation (Open Deep Research, Speakr, opencode, etc., and could probably be the same endpoint as chat)
Document intelligence endpoint (something like docling) for chat and RAG
ASR with segmentation, probably WhisperX (it didn’t appear to have an arm/cuda build, and I haven’t had a chance to make that happen, but this is the best one I’ve found so far)
Preferably with API keys and a way to track utilization

Thanks!

christopher_owen · February 1, 2026, 9:58pm

vllm-playground looks really promising, but I haven’t played with it yet. They just integrated vllm-omni support I believe.

eugr · February 1, 2026, 10:48pm

I use:

llama-swap to switch inference engines/models on the fly
LiteLLM Proxy as a single OpenAI compatible endpoint/gateway - routes calls to models on Spark, my other servers and cloud models with fallback, etc. It also supports Claude Code out of the box and can act as a proxy. And it keeps utilization stats/tracks costs (if applicable), etc.
OpenWebUI for chat/RAG/tool calling

Topic		Replies	Views
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	10	1334	January 25, 2026
DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference DGX Spark / GB10 Projects docker , spark , llm	8	697	February 6, 2026
Building Local + Hybrid LLMs on DGX Spark That Outperform Top Cloud Models DGX Spark / GB10 Projects jetson , nim , llama3-70b-instruct , llama , nemotron	4	303	February 4, 2026
HOW-TO: setup-dgx-spark docker inference - A "Sane" Inference Stack for GB10 (Need Contributors!) DGX Spark / GB10 docker , llama , dgx	11	180	February 7, 2026
How are you planning on using your DGX spark? DGX Spark / GB10 Projects	10	1289	February 4, 2026
GDX Spark is extremely slow on a short LLM test DGX Spark / GB10	20	2206	January 25, 2026
Runbook: bu-30b-a3b-preview-AWQ-4bit Model on DGX Spark (Solo) with vLLM + Browser-Use DGX Spark / GB10 Projects	1	61	February 4, 2026
Vibe Coding with NVIDIA DGX Spark DGX Spark / GB10	24	2830	January 25, 2026
Models not using Spark GPU? DGX Spark / GB10 containers	10	348	December 15, 2025
Playbook #1 – Open WebUI + SearXNG (Private Web Search) on DGX Spark DGX Spark / GB10 Projects nim	3	192	February 4, 2026

Best Mix of Models/Services on a Single Spark?

Related topics