Building Local + Hybrid LLMs on DGX Spark That Outperform Top Cloud Models

From “Can It Run?” to Sovereign Hybrid RAG on DGX Spark – Qwen3‑80B + vLLM + LiteLLM (Live Stack)


1. Why I’m Sharing This Stack

My DGX Spark is now running a full hybrid RAG stack with Qwen3‑80B at roughly 45 tok/s, under 150 ms end‑to‑end latency, and zero cloud on the critical path—and it already feels stronger than cloud assistants on privacy, cost, and customization. Most DGX Spark threads focus on “how do I get the model to load without OOM?”; I’m now at the point where the stack is stable, and I’m trying to turn it into a practical, privacy‑first assistant. This is my first serious attempt at a vLLM‑based hybrid RAG stack on DGX Spark, so I’m very open to suggestions, corrections, and better patterns from everyone here (including NVIDIA folks).

Right now I’m running Qwen3‑Next‑80B‑A3B at around 45 tok/s, with 115–120 GB of unified memory in use and roughly 95 % GPU utilization; Qwen3‑Embedding‑0.6B and Qwen3‑Reranker‑0.6B add under 1.5 GB VRAM overhead. My goal is to see how far I can go with a local + hybrid approach before I really need to lean on cloud LLMs. I don’t have any experience with LangChain or LangGraph yet—right now I’m just using LiteLLM as a simple router with per‑model settings—so any starter patterns or anti‑patterns for Mixture‑of‑Models orchestration would be very welcome.

1.1 Cloud vs Local Summary

Dimension Cloud Providers My Local + Hybrid Stack
Cost Recurring per‑token spend (expensive for deep research or code) 0.00 USD for ~75 % of queries after implementation
Latency 400–2000 ms (network + queue) <150-300 ms end‑to‑end on typical queries
Privacy / sovereignty Prompts and documents leave the machine; compliance is hard Zero egress (fallbacks only on explicit failure)
Customization Fixed models, black‑box routing Full control of temperature, routing depth, RAG limits, corpus composition
Private data Upload limits, no persistent memory Unlimited Qdrant corpus (my PDFs, DOCX, codebases, internal notes)
RAG and corpus quality Generic web only Targeted academic/technical expansion, multi‑hop citation graph
Open‑book Closed‑book benchmarks Permanently open‑book with CUDA docs, Blackwell papers, key arXiv papers in Qdrant
Implementation‑grade answers Often miss exact implementation details Retrieves exact passages and generates implementation‑grade answers

2. What I Currently Have Running

Everything sits behind pipelines:9099/v1, fronted by Open WebUI and orchestrated by pipe-fixed.py plus a few helper modules.

2.1 Live Containers and Roles

CONTAINER ID   IMAGE                                 STATUS          PORTS
f220e0ce3b2c   ghcr.io/berriai/litellm:main-latest   Up 3 h          0.0.0.0:4000->4000/tcp
41f392f6ff70   pipeline-deps:latest                  Up 33 m         0.0.0.0:9099->9099/tcp
71d157f02224   ghcr.io/open-webui/open-webui:cuda    Up 3 h (healthy)0.0.0.0:8080->8080/tcp
ea2f7071c12a   qdrant/qdrant:v1.11.0                 Up 3 h          0.0.0.0:6333->6333/tcp
cdd82e1d2238   searxng/searxng:latest                Up 21 h         0.0.0.0:8888->8080/tcp
5466fb74b97f   caddy:2-alpine                        Up 21 h
d801572ab84d   valkey/valkey:8-alpine                Up 21 h         6379/tcp
0423861c8044   nvcr.io/nvidia/vllm:25.12.post1-py3   Up 21 h         0.0.0.0:8025->8000/tcp
bde1af729d3a   nvcr.io/nvidia/vllm:25.12.post1-py3   Up 21 h         0.0.0.0:8020->8000/tcp
5f4c61cb0d7a   nvcr.io/nvidia/vllm:25.12.post1-py3   Up 2 d          0.0.0.0:8005->8000/tcp

What each one does (today):[1]

  • openwebui – User‑facing interface (chat, uploads, model selection).
  • pipeline-deps (pipelines) – Exposes pipelines:9099/v1 and runs my Hybrid‑RAG inlet and routing logic (pipe-fixed.py + helpers).
  • litellm – Central router for all models (local and hosted), handles temperature, top_p, max_tokens, and fallbacks.
  • qdrant – Local vector store for static documents, notes, and code.
  • searxng – Meta search engine used by web.py for live web search (JSON).
  • embedding – vLLM service running Qwen3‑Embedding‑0.6B for dense embeddings.
  • reranker – vLLM service running Qwen3‑Reranker‑0.6B for cross‑encoder reranking.
  • vllm-qwen-80 – vLLM service running Qwen3‑Next‑80B‑A3B for main generation.
  • valkey – Valkey/Redis used for caching and coordination.
  • caddy – Simple reverse proxy and TLS frontend.

Resources: memory sits around 115–120 GB out of 128 GB unified; GPU utilization is about 95 %. Embedding + reranker add under 1.5 GB VRAM on top of the 80B model.[1]


3. Routing and RAG Profiles

Open WebUI sends chat requests to pipelines:9099/v1, which calls pipe-fixed.py.[1]

  • pipe-fixed.py extracts the user message, applies MODEL_CLASSIFICATION_RULES, and picks a RAG profile from MODE_CONFIG.[1]
  • These limits are applied dynamically based on the selected model name in Open WebUI.
  • The profile controls how deep the system goes into web, Qdrant, and academic retrieval and how many chunks are reranked.[1]

3.1 MODE_CONFIG (Simplified)

Profile weblimit qdrantlimit academiclimit reranktop
Fast 5 10 5 10
Auto (Standard) 10 15 8 15
Expert 100 25 15 25
Heavy / Moe 250 30 20 30
Think 75 20 12 20
Code‑Fast 5 10 5 10
Code‑Expert 100 25 15 25
Code‑Heavy 250 30 20 30

Fast and Auto act as “light RAG”; Expert, Heavy, Moe, and Code‑Heavy go to full 250 web + 30 Qdrant + 20 academic depth.[1]

(Internally this is driven by simple rules such as “if model name matches Code-.* → use Code‑Heavy profile”, but I’ve kept the code out of the post for brevity.)

3.2 Parallel Retrieval

For each query, pipe-fixed.py launches three asynchronous retrievals:[1]

  • web.py → SearXNG at http://searxng:8888 (format=json, category “it” for code, “general” otherwise).
  • qdrant.py → local Qdrant, using embed_text() via Qwen3‑Embedding‑0.6B at http://embedding:8025.
  • academic.py → arXiv, PubMed, CORE, Semantic Scholar, OpenAlex (API keys, rate‑limits, multi‑hop expansion).

All results are merged into a single list of chunks for reranking.[1]

3.3 Reranking and Context Injection

  • reranker.py sends the query + merged chunks to Qwen3‑Reranker‑0.6B at http://reranker:8020.[1]
  • The reranker returns a list sorted by relevance score; I keep the top‑k (10–30, controlled by reranktop) and inject those chunks into the LLM prompt context.[1]

3.4 Generation and Fallbacks

  • LiteLLM at http://litellm:4000/v1 routes the request to the chosen model with the appropriate temperature and max_tokens.[2][1]
  • The primary path is Qwen3‑Next‑80B‑A3B on vLLM (port 8005), which produces grounded answers with citations.[1]
  • Responses go back to Open WebUI, including source references.[1]

Fallbacks to cloud models are reserved for exceptional cases—for example, timeouts, max_tokens limits, or explicit “use‑cloud” profiles—and are designed to keep cloud usage to a small fraction of total queries.


4. Model Routing, Temperatures, and Hosted MoM Options

I treat MODEL_CLASSIFICATION_RULES, MODE_CONFIG, and litellm_config.yaml as a simple dispatcher for models and RAG profiles.

4.1 Local SuperQwen Variants

From litellm_config.yaml, my local SuperQwen profiles (served via http://vllm-qwen-80:8000/v1) look roughly like this:

  • Auto / Fast / Expert / Heavy / Moe → all backed by openai/Superqwen, with temperatures in the 0.3–0.6 range and max_tokens from 512 up to 8192.
  • Code‑Fast / Code‑Expert / Code‑Heavy → same backend with lower temperatures (0.2–0.3) and larger max_tokens for code‑heavy tasks.

Current temperature by task:[2][1]

Profile family Typical temperature Max tokens (approx) Thinking
Code‑* 0.20–0.30 2 048–8 192 On for Expert/Heavy
Research / Expert 0.45 6 144 On
Heavy / Think / Moe 0.50 up to 8 192 On
Fast 0.30 512–2 048 Off

4.2 Hosted Models I’m Experimenting With

Through LiteLLM I also have a universe of hosted models configured, mainly for future MoM experiments:[2]

  • xAI / Grok: Grok‑4, Grok‑4 Fast Reasoning, Grok‑4.1 Fast Reasoning, Code‑Grok4, Code‑Grok4‑Fast.
  • Perplexity: Perplexity Sonar, Sonar Pro, Sonar Reasoning Pro, Sonar Deep Research (return_citations: true).
  • NVIDIA hosted: nvidia_nim/meta/llama3-70b-instruct, nvidia_nim/qwen/qwen3-next-80b-a3b-thinking, GLM‑4.7‑style code models, Nemotron‑3‑Nano‑30B‑A3B (NVFP4).
  • Hugging Face router: Qwen2.5‑72B‑Instruct via router.huggingface.co.
  • DashScope (Qwen family): Qwen‑Turbo, Qwen‑Plus, Qwen‑Max, Qwen‑VL, and Qwen3 models such as Qwen3‑32B, Qwen3‑30B‑A3B, Qwen3‑235B‑A22B with enable_thinking flags.[2]

Right now I don’t have LangChain or LangGraph on top of this; it’s just LiteLLM plus routing rules. I’m especially interested in providers that expose temperature, top_p, max_tokens, and “thinking” flags cleanly so they can slot into a Mixture‑of‑Models graph later.[2][1]

I’m also very interested in memory usage and real‑world performance of hybrid MoM setups with Nemotron‑3‑Nano‑30B‑A3B (NVFP4), GLM‑4.x, and strong GPT‑class OSS models under a 120 GB unified‑memory budget. My hunch is that a single Spark can be very competitive locally, and that a 2–6 node Spark cluster would be exceptional for MoM, but I’d really like to see real numbers from people who have tried it.[1]


5. Why I Chose a Qwen3‑Aligned RAG Stack

Instead of the default Open WebUI RAG (MiniLM/SBERT + cosine), I switched to a fully Qwen3‑aligned stack: Qwen3‑Embedding‑0.6B + Qwen3‑Reranker‑0.6B + Qwen3‑Next‑80B‑A3B.[1]

5.1 RAG Stack Comparison

Feature Open WebUI Default RAG My Qwen3 Stack (Embedding + Reranker + LLM)
Embedding model Generic (all‑MiniLM, Sentence‑BERT) Qwen3‑Embedding‑0.6B – multilingual, domain‑strong, family‑aligned
Retrieval method Dense cosine similarity only Dense + cross‑encoder reranking (Qwen3‑Reranker‑0.6B)
Relevance lift Baseline Roughly 15–30 % better on technical/multilingual queries in practice
Model coherence Mixed models → inconsistency Same Qwen3 family → embedding/reranker/LLM synergy
Latency / memory Higher overhead if using cloud embeddings <1.5 GB VRAM total for embed + reranker, sub‑100 ms local RAG
Customization Limited Full vLLM tuning, LoRA fine‑tuning possible
Accuracy on my data Good for general text Superior after domain‑specific corpus + tuning

My working assumption is that keeping everything in the same model family reduces semantic drift between embedding, reranking, and generation; I’d love to see counter‑examples or benchmarks (e.g., BGE‑M3 vs Qwen3‑Reranker on DGX Spark) from others.


6. Advantages Table (Cloud vs DGX Spark Hybrid)

These rows are meant as a qualitative comparison, not precise pricing.

6.1 Cloud LLMs vs My DGX Spark Hybrid

Feature OpenAI GPT‑4‑class Anthropic Claude‑class Google Gemini‑class Perplexity Sonar Pro xAI Grok‑4 Heavy My DGX Spark Hybrid
Cost Per‑token pricing Per‑token pricing Per‑token pricing Per‑token or Pro tier Roughly fixed monthly 0.00 USD per query after hardware
Latency About 800–2000 ms About 600–1500 ms About 500–1200 ms About 400–900 ms About 300–800 ms <500 ms end‑to‑end
Data privacy Prompts go to OpenAI Prompts go to Anthropic Prompts go to Google Prompts go to Perplexity Prompts go to xAI Zero egress by default
Private docs File caps, limited memory Limited Varies, often small caps No persistent private KB None Unlimited Qdrant corpus
Academic live search No No Limited No No arXiv, PubMed, CORE, Semantic Scholar, etc. via academic.py
Reranking None / proprietary None / proprietary None / proprietary Proprietary None Qwen3‑Reranker‑0.6B two‑stage RAG
Temperature / routing Mostly fixed Limited Limited Black‑box Limited Full per‑profile control
Rate limits Hard limits Hard limits Hard / RPM caps Hard limits Hard limits None beyond my hardware

Cloud limitations in one sentence each (from my perspective):

  • OpenAI / Anthropic / Google: I pay for every token, my data leaves the box, private files are constrained, and I don’t get full per‑task temperature and routing control.
  • Perplexity: dynamic RAG and academic routing are excellent, but I’m locked into their models, pricing, and black‑box decisions.
  • Grok‑4 Heavy: a flat monthly fee for “uncensored” reasoning that I can get close to, or match, on my own hardware for near‑zero marginal cost.

Perplexity’s “secret sauce” is dynamic RAG + intelligent model routing + academic search; my goal with this stack is to do the same thing locally, but with my own reranker, my own corpus, and my own fallback logic.


7. Corpus Building and Fine‑Tuning Plans

7.1 Network‑Effect Corpus Building

For domains like machine learning and AI, I’m trying a network‑effect corpus strategy:

  1. Use academic.py to fetch around 20 key LLM/ML papers (arXiv, Semantic Scholar).
  2. Expand by authors, co‑authors, references, and citations 5–10 hops out.
  3. Chunk and embed everything with Qwen3‑Embedding‑0.6B and store it in Qdrant.

This already feels more like a domain “bible” than a generic web crawl.

7.2 Fine‑Tuning Qwen3‑Embedding‑0.6B (Planned)

I haven’t fine‑tuned yet; my plan is:

  • Build triplets (query, positive_chunk, negative_chunk) from Qdrant and historical queries.
  • Add domain prompts such as “Represent this GPU programming document for semantic search…”.
  • Use LoRA with SWIFT/DeepSpeed or SentenceTransformers for contrastive training.
  • Evaluate with MRR@10, NDCG, and RAGAS, then deploy by updating start-embedder-reranker.yaml.

Any practical advice on hyperparameters, batch sizes, or DGX‑specific pitfalls would be very welcome.


8. Prompt Flow Diagram

This diagram matches the current routing and RAG flow. Click Diagram to Enlarge:

Key advantage: the explicit reranking step eliminates noisy chunks that dense retrieval alone would include.

9. Mixture of Models (MoM) and Open‑Book RAG – Where I Want to Go Next

I’m not using LangChain or LangGraph yet; right now everything is LiteLLM plus my own routing logic. My next goal is to turn this into a true Mixture of Models graph, where local and hosted models work together instead of competing.[2][1]

9.1 Planned MoM Flow (LangChain + LangGraph)

The flow I’d like to build (feedback welcome):[1]

  • A complex research query arrives → LiteLLM routes first to a local MoE‑style Qwen3‑80B profile for an initial reasoning pass.
  • If the graph detects high uncertainty or a very demanding task, it spawns parallel branches:
    • Branch A: a “Think” model such as Qwen3‑Next‑80B‑thinking via NVIDIA NIM for deeper chain‑of‑thought.
    • Branch B: Perplexity Sonar Deep Research for fresh web citations.
    • Branch C: Code‑oriented models such as Code‑Grok4, GLM‑4.x, or Qwen3‑235B when there is heavy code or math.
  • Outputs from these branches are merged, reranked, and synthesized back into a single answer by the strongest local model, using my own RAG context.

The idea is that >80 % of the compute still happens locally, and cloud models act as optional specialists used only on the hardest 5-10% of queries and typically for only a few thousand tokens. With fallback rules based on timeout, max_tokens, and usage‑based routing, the cost should stay low even with those specialists in the loop.[1]

I’m still evaluating whether I really need full LangGraph for this or whether smarter LiteLLM‑only routing would be enough, so any real‑world experience with stateful MoM graphs on Spark hardware would be hugely helpful.

9.2 Open‑Book vs Closed‑Book

Most leaderboards test models in a closed‑book setting: no internet, no private docs, no external corpus. My target setup is deliberately open‑book plus live research:[1]

  • Drop entire domain “bibles” into Qdrant: CUDA docs, PyTorch and Triton source, internal specs, and relevant arXiv papers.
  • On each query, retrieve the exact passages that matter and let the LLM reason with the book open.

For example, once the corpus is in place, I want to be able to ask:

“How do I implement a custom FlashAttention kernel for Blackwell that supports NVFP4?”

Closed‑book models, even very strong ones, may hallucinate or provide outdated guidance. My goal is for the local stack to pull kernel signatures, relevant sections of the Blackwell whitepaper, correct Triton examples, and the latest FlashAttention‑3 paper, then produce implementation‑ready code.[1]

This is why I see local RAG and MoM as the next step beyond cloud‑only usage: once the corpus is right, general benchmarks become less important than actual performance on my own data and tasks.


10. Alternative Stacks for Other Workloads

A full hybrid RAG + reranker + 80B stack is not the ideal fit for every workload. These are other configurations I think make sense:

Workload Recommended stack Why it might be better than my full hybrid
Pure fast chat / code Ollama + DeepSeek‑R1 or Qwen2.5‑32B Lower latency, simpler, no reranker overhead.
Vision / multimodal Local Qwen‑VL‑Max or LLaVA‑Next Native image and multimodal understanding.
Heavy fine‑tuning NeMo on 1–2 Sparks Optimized for distributed training and experimentation.
Edge / low‑power Jetson Orin + small GGUF models Battery‑friendly, deploy at the edge.
Zero‑GPU setups llama.cpp on Mac M‑series No NVIDIA hardware required.
Multi‑node scaling 2–6 Spark cluster + tensor‑parallel vLLM Near‑linear throughput; 235B‑class models become realistic.

My hope is that DGX Spark can be the “sovereign core” for heavy RAG + MoM, while lighter stacks handle edge and low‑power scenarios.


11. The Levers I’m Trying to Tune

I’m still new to LangGraph/LangChain and to designing serious Mixture‑of‑Models workflows, so any feedback—from high‑level architecture to small config tweaks—would be genuinely appreciated.

I’d especially love input on:

  • MoM orchestration patterns (starter LangGraph flows or alternatives).
  • Embedding fine‑tuning pitfalls on Spark (for Qwen3‑Embedding‑0.6B or similar).
  • Real‑world multi‑Spark scaling numbers (2–6 nodes, tensor parallel, 235B‑class models).
  • Anything in this stack that looks obviously wrong, fragile, or over‑engineered.

12. Playbooks

I’m planning to publish a five‑part playbook series on this vLLM stack as it stabilizes. Web‑search has been stable for about two weeks, and I’ll be publishing that playbook this week.

# Playbook Title Core Content
1 Open WebUI + SearXNG (Private Web Search) Enable private web search via SearXNG, enforce JSON output, surface citations directly in the UI.
2 LiteLLM Smart Routing & Dynamic Temperature Map model names to RAG profiles, set per‑task temperatures, graceful fallbacks to cloud models only on failure.
3 Static RAG with Qdrant (Local Document Indexing) Ingest 50+ PDFs/DOCX, auto‑chunk, embed with Qwen3‑Embedding‑0.6B, store in Qdrant, expose via hybrid retrieval.
4 Dynamic RAG – Academic Sources (arXiv, PubMed, CORE, etc.) Build academic.py with API keys, rate‑limiting, multi‑hop expansion to create a deep, high‑quality academic corpus.
5 Qwen3 Embedding + Reranker (Local RAG Engine) Deploy Qwen3‑Embedding‑0.6B and Qwen3‑Reranker‑0.6B on vLLM, benchmark vs BGE‑M3, optionally fine‑tune on your own corpus (LoRA, SWIFT, DeepSpeed).

Good Luck developing your own Production LLM Stack.

Mark

2 Likes

Thanks for sharing this, I will move this to GB10 projects

This seems to be a nice setup.

As for the agentic framework, my suggestion would be to go with PydanticAI instead of LangChain/LangGraph. It’s much easier to debug and is not overly bloated and over-engineered.

1 Like

Here is the first playbook in the series. Hopefully everything is correct, it is 3-4 weeks since I coded this.

Great write up. I am in a very similar boat. My end goal is to completely get away from chatGPT or any cloud models and have everything local.

I haven’t had much luck with litellm integrated with Open WebUI it seems to bug out quite often. I am curious though how your smart routing and dynamic temperatures works with it. Do you have more details on that?

1 Like