Building Local + Hybrid LLMs on DGX Spark That Outperform Top Cloud Models

From “Can It Run?” to Sovereign Hybrid RAG on DGX Spark – Qwen3‑80B + vLLM + LiteLLM (Live Stack)


1. Why I’m Sharing This Stack

My DGX Spark is now running a full hybrid RAG stack with Qwen3‑80B at roughly 45 tok/s, under 150 ms end‑to‑end latency, and zero cloud on the critical path—and it already feels stronger than cloud assistants on privacy, cost, and customization. Most DGX Spark threads focus on “how do I get the model to load without OOM?”; I’m now at the point where the stack is stable, and I’m trying to turn it into a practical, privacy‑first assistant. This is my first serious attempt at a vLLM‑based hybrid RAG stack on DGX Spark, so I’m very open to suggestions, corrections, and better patterns from everyone here (including NVIDIA folks).

Right now I’m running Qwen3‑Next‑80B‑A3B at around 45 tok/s, with 115–120 GB of unified memory in use and roughly 95 % GPU utilization; Qwen3‑Embedding‑0.6B and Qwen3‑Reranker‑0.6B add under 1.5 GB VRAM overhead. My goal is to see how far I can go with a local + hybrid approach before I really need to lean on cloud LLMs. I don’t have any experience with LangChain or LangGraph yet—right now I’m just using LiteLLM as a simple router with per‑model settings—so any starter patterns or anti‑patterns for Mixture‑of‑Models orchestration would be very welcome.

1.1 Cloud vs Local Summary

Dimension Cloud Providers My Local + Hybrid Stack
Cost Recurring per‑token spend (expensive for deep research or code) 0.00 USD for ~75 % of queries after implementation
Latency 400–2000 ms (network + queue) <150-300 ms end‑to‑end on typical queries
Privacy / sovereignty Prompts and documents leave the machine; compliance is hard Zero egress (fallbacks only on explicit failure)
Customization Fixed models, black‑box routing Full control of temperature, routing depth, RAG limits, corpus composition
Private data Upload limits, no persistent memory Unlimited Qdrant corpus (my PDFs, DOCX, codebases, internal notes)
RAG and corpus quality Generic web only Targeted academic/technical expansion, multi‑hop citation graph
Open‑book Closed‑book benchmarks Permanently open‑book with CUDA docs, Blackwell papers, key arXiv papers in Qdrant
Implementation‑grade answers Often miss exact implementation details Retrieves exact passages and generates implementation‑grade answers

2. What I Currently Have Running

Everything sits behind pipelines:9099/v1, fronted by Open WebUI and orchestrated by pipe-fixed.py plus a few helper modules.

2.1 Live Containers and Roles

CONTAINER ID   IMAGE                                 STATUS          PORTS
f220e0ce3b2c   ghcr.io/berriai/litellm:main-latest   Up 3 h          0.0.0.0:4000->4000/tcp
41f392f6ff70   pipeline-deps:latest                  Up 33 m         0.0.0.0:9099->9099/tcp
71d157f02224   ghcr.io/open-webui/open-webui:cuda    Up 3 h (healthy)0.0.0.0:8080->8080/tcp
ea2f7071c12a   qdrant/qdrant:v1.11.0                 Up 3 h          0.0.0.0:6333->6333/tcp
cdd82e1d2238   searxng/searxng:latest                Up 21 h         0.0.0.0:8888->8080/tcp
5466fb74b97f   caddy:2-alpine                        Up 21 h
d801572ab84d   valkey/valkey:8-alpine                Up 21 h         6379/tcp
0423861c8044   nvcr.io/nvidia/vllm:25.12.post1-py3   Up 21 h         0.0.0.0:8025->8000/tcp
bde1af729d3a   nvcr.io/nvidia/vllm:25.12.post1-py3   Up 21 h         0.0.0.0:8020->8000/tcp
5f4c61cb0d7a   nvcr.io/nvidia/vllm:25.12.post1-py3   Up 2 d          0.0.0.0:8005->8000/tcp

What each one does (today):[1]

  • openwebui – User‑facing interface (chat, uploads, model selection).
  • pipeline-deps (pipelines) – Exposes pipelines:9099/v1 and runs my Hybrid‑RAG inlet and routing logic (pipe-fixed.py + helpers).
  • litellm – Central router for all models (local and hosted), handles temperature, top_p, max_tokens, and fallbacks.
  • qdrant – Local vector store for static documents, notes, and code.
  • searxng – Meta search engine used by web.py for live web search (JSON).
  • embedding – vLLM service running Qwen3‑Embedding‑0.6B for dense embeddings.
  • reranker – vLLM service running Qwen3‑Reranker‑0.6B for cross‑encoder reranking.
  • vllm-qwen-80 – vLLM service running Qwen3‑Next‑80B‑A3B for main generation.
  • valkey – Valkey/Redis used for caching and coordination.
  • caddy – Simple reverse proxy and TLS frontend.

Resources: memory sits around 115–120 GB out of 128 GB unified; GPU utilization is about 95 %. Embedding + reranker add under 1.5 GB VRAM on top of the 80B model.[1]


3. Routing and RAG Profiles

Open WebUI sends chat requests to pipelines:9099/v1, which calls pipe-fixed.py.[1]

  • pipe-fixed.py extracts the user message, applies MODEL_CLASSIFICATION_RULES, and picks a RAG profile from MODE_CONFIG.[1]
  • These limits are applied dynamically based on the selected model name in Open WebUI.
  • The profile controls how deep the system goes into web, Qdrant, and academic retrieval and how many chunks are reranked.[1]

3.1 MODE_CONFIG (Simplified)

Profile weblimit qdrantlimit academiclimit reranktop
Fast 5 10 5 10
Auto (Standard) 10 15 8 15
Expert 100 25 15 25
Heavy / Moe 250 30 20 30
Think 75 20 12 20
Code‑Fast 5 10 5 10
Code‑Expert 100 25 15 25
Code‑Heavy 250 30 20 30

Fast and Auto act as “light RAG”; Expert, Heavy, Moe, and Code‑Heavy go to full 250 web + 30 Qdrant + 20 academic depth.[1]

(Internally this is driven by simple rules such as “if model name matches Code-.* → use Code‑Heavy profile”, but I’ve kept the code out of the post for brevity.)

3.2 Parallel Retrieval

For each query, pipe-fixed.py launches three asynchronous retrievals:[1]

  • web.py → SearXNG at http://searxng:8888 (format=json, category “it” for code, “general” otherwise).
  • qdrant.py → local Qdrant, using embed_text() via Qwen3‑Embedding‑0.6B at http://embedding:8025.
  • academic.py → arXiv, PubMed, CORE, Semantic Scholar, OpenAlex (API keys, rate‑limits, multi‑hop expansion).

All results are merged into a single list of chunks for reranking.[1]

3.3 Reranking and Context Injection

  • reranker.py sends the query + merged chunks to Qwen3‑Reranker‑0.6B at http://reranker:8020.[1]
  • The reranker returns a list sorted by relevance score; I keep the top‑k (10–30, controlled by reranktop) and inject those chunks into the LLM prompt context.[1]

3.4 Generation and Fallbacks

  • LiteLLM at http://litellm:4000/v1 routes the request to the chosen model with the appropriate temperature and max_tokens.[2][1]
  • The primary path is Qwen3‑Next‑80B‑A3B on vLLM (port 8005), which produces grounded answers with citations.[1]
  • Responses go back to Open WebUI, including source references.[1]

Fallbacks to cloud models are reserved for exceptional cases—for example, timeouts, max_tokens limits, or explicit “use‑cloud” profiles—and are designed to keep cloud usage to a small fraction of total queries.


4. Model Routing, Temperatures, and Hosted MoM Options

I treat MODEL_CLASSIFICATION_RULES, MODE_CONFIG, and litellm_config.yaml as a simple dispatcher for models and RAG profiles.

4.1 Local SuperQwen Variants

From litellm_config.yaml, my local SuperQwen profiles (served via http://vllm-qwen-80:8000/v1) look roughly like this:

  • Auto / Fast / Expert / Heavy / Moe → all backed by openai/Superqwen, with temperatures in the 0.3–0.6 range and max_tokens from 512 up to 8192.
  • Code‑Fast / Code‑Expert / Code‑Heavy → same backend with lower temperatures (0.2–0.3) and larger max_tokens for code‑heavy tasks.

Current temperature by task:[2][1]

Profile family Typical temperature Max tokens (approx) Thinking
Code‑* 0.20–0.30 2 048–8 192 On for Expert/Heavy
Research / Expert 0.45 6 144 On
Heavy / Think / Moe 0.50 up to 8 192 On
Fast 0.30 512–2 048 Off

4.2 Hosted Models I’m Experimenting With

Through LiteLLM I also have a universe of hosted models configured, mainly for future MoM experiments:[2]

  • xAI / Grok: Grok‑4, Grok‑4 Fast Reasoning, Grok‑4.1 Fast Reasoning, Code‑Grok4, Code‑Grok4‑Fast.
  • Perplexity: Perplexity Sonar, Sonar Pro, Sonar Reasoning Pro, Sonar Deep Research (return_citations: true).
  • NVIDIA hosted: nvidia_nim/meta/llama3-70b-instruct, nvidia_nim/qwen/qwen3-next-80b-a3b-thinking, GLM‑4.7‑style code models, Nemotron‑3‑Nano‑30B‑A3B (NVFP4).
  • Hugging Face router: Qwen2.5‑72B‑Instruct via router.huggingface.co.
  • DashScope (Qwen family): Qwen‑Turbo, Qwen‑Plus, Qwen‑Max, Qwen‑VL, and Qwen3 models such as Qwen3‑32B, Qwen3‑30B‑A3B, Qwen3‑235B‑A22B with enable_thinking flags.[2]

Right now I don’t have LangChain or LangGraph on top of this; it’s just LiteLLM plus routing rules. I’m especially interested in providers that expose temperature, top_p, max_tokens, and “thinking” flags cleanly so they can slot into a Mixture‑of‑Models graph later.[2][1]

I’m also very interested in memory usage and real‑world performance of hybrid MoM setups with Nemotron‑3‑Nano‑30B‑A3B (NVFP4), GLM‑4.x, and strong GPT‑class OSS models under a 120 GB unified‑memory budget. My hunch is that a single Spark can be very competitive locally, and that a 2–6 node Spark cluster would be exceptional for MoM, but I’d really like to see real numbers from people who have tried it.[1]


5. Why I Chose a Qwen3‑Aligned RAG Stack

Instead of the default Open WebUI RAG (MiniLM/SBERT + cosine), I switched to a fully Qwen3‑aligned stack: Qwen3‑Embedding‑0.6B + Qwen3‑Reranker‑0.6B + Qwen3‑Next‑80B‑A3B.[1]

5.1 RAG Stack Comparison

Feature Open WebUI Default RAG My Qwen3 Stack (Embedding + Reranker + LLM)
Embedding model Generic (all‑MiniLM, Sentence‑BERT) Qwen3‑Embedding‑0.6B – multilingual, domain‑strong, family‑aligned
Retrieval method Dense cosine similarity only Dense + cross‑encoder reranking (Qwen3‑Reranker‑0.6B)
Relevance lift Baseline Roughly 15–30 % better on technical/multilingual queries in practice
Model coherence Mixed models → inconsistency Same Qwen3 family → embedding/reranker/LLM synergy
Latency / memory Higher overhead if using cloud embeddings <1.5 GB VRAM total for embed + reranker, sub‑100 ms local RAG
Customization Limited Full vLLM tuning, LoRA fine‑tuning possible
Accuracy on my data Good for general text Superior after domain‑specific corpus + tuning

My working assumption is that keeping everything in the same model family reduces semantic drift between embedding, reranking, and generation; I’d love to see counter‑examples or benchmarks (e.g., BGE‑M3 vs Qwen3‑Reranker on DGX Spark) from others.


6. Advantages Table (Cloud vs DGX Spark Hybrid)

These rows are meant as a qualitative comparison, not precise pricing.

6.1 Cloud LLMs vs My DGX Spark Hybrid

Feature OpenAI GPT‑4‑class Anthropic Claude‑class Google Gemini‑class Perplexity Sonar Pro xAI Grok‑4 Heavy My DGX Spark Hybrid
Cost Per‑token pricing Per‑token pricing Per‑token pricing Per‑token or Pro tier Roughly fixed monthly 0.00 USD per query after hardware
Latency About 800–2000 ms About 600–1500 ms About 500–1200 ms About 400–900 ms About 300–800 ms <500 ms end‑to‑end
Data privacy Prompts go to OpenAI Prompts go to Anthropic Prompts go to Google Prompts go to Perplexity Prompts go to xAI Zero egress by default
Private docs File caps, limited memory Limited Varies, often small caps No persistent private KB None Unlimited Qdrant corpus
Academic live search No No Limited No No arXiv, PubMed, CORE, Semantic Scholar, etc. via academic.py
Reranking None / proprietary None / proprietary None / proprietary Proprietary None Qwen3‑Reranker‑0.6B two‑stage RAG
Temperature / routing Mostly fixed Limited Limited Black‑box Limited Full per‑profile control
Rate limits Hard limits Hard limits Hard / RPM caps Hard limits Hard limits None beyond my hardware

Cloud limitations in one sentence each (from my perspective):

  • OpenAI / Anthropic / Google: I pay for every token, my data leaves the box, private files are constrained, and I don’t get full per‑task temperature and routing control.
  • Perplexity: dynamic RAG and academic routing are excellent, but I’m locked into their models, pricing, and black‑box decisions.
  • Grok‑4 Heavy: a flat monthly fee for “uncensored” reasoning that I can get close to, or match, on my own hardware for near‑zero marginal cost.

Perplexity’s “secret sauce” is dynamic RAG + intelligent model routing + academic search; my goal with this stack is to do the same thing locally, but with my own reranker, my own corpus, and my own fallback logic.


7. Corpus Building and Fine‑Tuning Plans

7.1 Network‑Effect Corpus Building

For domains like machine learning and AI, I’m trying a network‑effect corpus strategy:

  1. Use academic.py to fetch around 20 key LLM/ML papers (arXiv, Semantic Scholar).
  2. Expand by authors, co‑authors, references, and citations 5–10 hops out.
  3. Chunk and embed everything with Qwen3‑Embedding‑0.6B and store it in Qdrant.

This already feels more like a domain “bible” than a generic web crawl.

7.2 Fine‑Tuning Qwen3‑Embedding‑0.6B (Planned)

I haven’t fine‑tuned yet; my plan is:

  • Build triplets (query, positive_chunk, negative_chunk) from Qdrant and historical queries.
  • Add domain prompts such as “Represent this GPU programming document for semantic search…”.
  • Use LoRA with SWIFT/DeepSpeed or SentenceTransformers for contrastive training.
  • Evaluate with MRR@10, NDCG, and RAGAS, then deploy by updating start-embedder-reranker.yaml.

Any practical advice on hyperparameters, batch sizes, or DGX‑specific pitfalls would be very welcome.


8. Prompt Flow Diagram

This diagram matches the current routing and RAG flow. Click Diagram to Enlarge:

Key advantage: the explicit reranking step eliminates noisy chunks that dense retrieval alone would include.

9. Mixture of Models (MoM) and Open‑Book RAG – Where I Want to Go Next

I’m not using LangChain or LangGraph yet; right now everything is LiteLLM plus my own routing logic. My next goal is to turn this into a true Mixture of Models graph, where local and hosted models work together instead of competing.[2][1]

9.1 Planned MoM Flow (LangChain + LangGraph)

The flow I’d like to build (feedback welcome):[1]

  • A complex research query arrives → LiteLLM routes first to a local MoE‑style Qwen3‑80B profile for an initial reasoning pass.
  • If the graph detects high uncertainty or a very demanding task, it spawns parallel branches:
    • Branch A: a “Think” model such as Qwen3‑Next‑80B‑thinking via NVIDIA NIM for deeper chain‑of‑thought.
    • Branch B: Perplexity Sonar Deep Research for fresh web citations.
    • Branch C: Code‑oriented models such as Code‑Grok4, GLM‑4.x, or Qwen3‑235B when there is heavy code or math.
  • Outputs from these branches are merged, reranked, and synthesized back into a single answer by the strongest local model, using my own RAG context.

The idea is that >80 % of the compute still happens locally, and cloud models act as optional specialists used only on the hardest 5-10% of queries and typically for only a few thousand tokens. With fallback rules based on timeout, max_tokens, and usage‑based routing, the cost should stay low even with those specialists in the loop.[1]

I’m still evaluating whether I really need full LangGraph for this or whether smarter LiteLLM‑only routing would be enough, so any real‑world experience with stateful MoM graphs on Spark hardware would be hugely helpful.

9.2 Open‑Book vs Closed‑Book

Most leaderboards test models in a closed‑book setting: no internet, no private docs, no external corpus. My target setup is deliberately open‑book plus live research:[1]

  • Drop entire domain “bibles” into Qdrant: CUDA docs, PyTorch and Triton source, internal specs, and relevant arXiv papers.
  • On each query, retrieve the exact passages that matter and let the LLM reason with the book open.

For example, once the corpus is in place, I want to be able to ask:

“How do I implement a custom FlashAttention kernel for Blackwell that supports NVFP4?”

Closed‑book models, even very strong ones, may hallucinate or provide outdated guidance. My goal is for the local stack to pull kernel signatures, relevant sections of the Blackwell whitepaper, correct Triton examples, and the latest FlashAttention‑3 paper, then produce implementation‑ready code.[1]

This is why I see local RAG and MoM as the next step beyond cloud‑only usage: once the corpus is right, general benchmarks become less important than actual performance on my own data and tasks.


10. Alternative Stacks for Other Workloads

A full hybrid RAG + reranker + 80B stack is not the ideal fit for every workload. These are other configurations I think make sense:

Workload Recommended stack Why it might be better than my full hybrid
Pure fast chat / code Ollama + DeepSeek‑R1 or Qwen2.5‑32B Lower latency, simpler, no reranker overhead.
Vision / multimodal Local Qwen‑VL‑Max or LLaVA‑Next Native image and multimodal understanding.
Heavy fine‑tuning NeMo on 1–2 Sparks Optimized for distributed training and experimentation.
Edge / low‑power Jetson Orin + small GGUF models Battery‑friendly, deploy at the edge.
Zero‑GPU setups llama.cpp on Mac M‑series No NVIDIA hardware required.
Multi‑node scaling 2–6 Spark cluster + tensor‑parallel vLLM Near‑linear throughput; 235B‑class models become realistic.

My hope is that DGX Spark can be the “sovereign core” for heavy RAG + MoM, while lighter stacks handle edge and low‑power scenarios.


11. The Levers I’m Trying to Tune

I’m still new to LangGraph/LangChain and to designing serious Mixture‑of‑Models workflows, so any feedback—from high‑level architecture to small config tweaks—would be genuinely appreciated.

I’d especially love input on:

  • MoM orchestration patterns (starter LangGraph flows or alternatives).
  • Embedding fine‑tuning pitfalls on Spark (for Qwen3‑Embedding‑0.6B or similar).
  • Real‑world multi‑Spark scaling numbers (2–6 nodes, tensor parallel, 235B‑class models).
  • Anything in this stack that looks obviously wrong, fragile, or over‑engineered.

12. Playbooks

I’m planning to publish a five‑part playbook series on this vLLM stack as it stabilizes. Web‑search has been stable for about two weeks, and I’ll be publishing that playbook this week.

# Playbook Title Core Content
1 Open WebUI + SearXNG (Private Web Search) Enable private web search via SearXNG, enforce JSON output, surface citations directly in the UI.
2 LiteLLM Smart Routing & Dynamic Temperature Map model names to RAG profiles, set per‑task temperatures, graceful fallbacks to cloud models only on failure.
3 Static RAG with Qdrant (Local Document Indexing) Ingest 50+ PDFs/DOCX, auto‑chunk, embed with Qwen3‑Embedding‑0.6B, store in Qdrant, expose via hybrid retrieval.
4 Dynamic RAG – Academic Sources (arXiv, PubMed, CORE, etc.) Build academic.py with API keys, rate‑limiting, multi‑hop expansion to create a deep, high‑quality academic corpus.
5 Qwen3 Embedding + Reranker (Local RAG Engine) Deploy Qwen3‑Embedding‑0.6B and Qwen3‑Reranker‑0.6B on vLLM, benchmark vs BGE‑M3, optionally fine‑tune on your own corpus (LoRA, SWIFT, DeepSpeed).

Good Luck developing your own Production LLM Stack.

Mark

20 Likes

Thanks for sharing this, I will move this to GB10 projects

2 Likes

This seems to be a nice setup.

As for the agentic framework, my suggestion would be to go with PydanticAI instead of LangChain/LangGraph. It’s much easier to debug and is not overly bloated and over-engineered.

2 Likes

Here is the first playbook in the series. Hopefully everything is correct, it is 3-4 weeks since I coded this.

2 Likes

Great write up. I am in a very similar boat. My end goal is to completely get away from chatGPT or any cloud models and have everything local.

I haven’t had much luck with litellm integrated with Open WebUI it seems to bug out quite often. I am curious though how your smart routing and dynamic temperatures works with it. Do you have more details on that?

1 Like

Playbook 2 – Turning Spark Arena Models into a Unified Multi‑Mode Assistant with LiteLLM + Open WebUI


Since my original post, things have evolved a lot. Spark Arena and the GB10 community have gone from “here are some cool models” to a full recipe + leaderboard ecosystem that makes it trivial to stand up Nemotron‑3‑Nano NVFP4, GLM‑4.7‑Flash‑AWQ, Qwen3‑Coder‑Next‑FP8, Qwen‑Instruct‑80B and more on a single or dual DGX Spark. The net effect is a quantum leap in developer velocity: instead of burning hours on flags and container wiring, we can treat these as reliable building blocks and focus on UX, routing, and product behavior.

Huge thanks to Raphael Amorim (@raphael.amorim), @eugr, and everyone working on spark‑vllm‑docker, llama‑benchy, and Spark Arena itself — Playbook 2 is entirely about what you can build on top of their work, not yet another way to launch a model.

What follows is how, on a single Spark (usually one primary Spark Arena model up at a time), you can use LiteLLM + Open WebUI to turn that into a clean, multi‑mode assistant stack with natural model names like NeMo Expert, GLM Expert, Qwen Coder Fast, Phi Fast, Grok-4, etc. LiteLLM resolves everything; Open WebUI just shows these names in the dropdown.


1. Architecture: Open WebUI → LiteLLM → Spark Arena + hosted

The architecture is intentionally simple:

  • Open WebUI

    • Talks only to LiteLLM via an OpenAI‑compatible endpoint.

    • In start_core_services.yml:

      environment:
        OPENAI_API_BASE_URL: http://litellm:4000/v1
        OPENAI_API_KEY: simple-api-key
      
  • LiteLLM proxy/router

    • Runs as a small service next to Open WebUI:

      services:
        litellm:
          image: ghcr.io/berriai/litellm:main-latest
          container_name: litellm
          restart: always
          command: --config /app/config.yaml --num-workers 4
          ports:
            - "4000:4000"
          volumes:
            - ./litellm_config.yaml:/app/config.yaml:ro
          env_file:
            - /home/mark/vllm/model-configs/env.env
          networks: [llm-net]
      
    • Reads a single litellm_config.yaml that defines all local and hosted models plus routing rules.

  • Spark Arena vLLM backends + hosted APIs

    • One primary big chat model via Spark Arena (Nemotron‑3‑Nano or GLM‑4.7 or Qwen‑Instruct) at a time, plus Qwen‑Coder and a small router like Phi‑4‑mini.
    • Hosted backups: Grok‑3/4, Qwen‑Max/Qwen3 cloud, GLM‑4.7 cloud, NVIDIA Llama‑3‑70B, etc.

Open WebUI only sees model names from LiteLLM – e.g., NeMo Expert, GLM Expert, Qwen Coder Fast, Phi Fast, Grok-4 – and LiteLLM resolves them to whichever backend is actually running.


2. LiteLLM model_list: real model names, real modes

Your recipes from Spark Arena already defines the modes you need, using concrete model names.

Local Nemotron‑Nano profiles

# ── LOCAL NEMOTRON NANO (optimized for speed) ──
- model_name: "Base" # Native (thinking ON)
  litellm_params:
    model: openai/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
    api_base: http://host.docker.internal:8000/v1
    api_key: os.environ/NEMOTRON_API_KEY
    temperature: 0.4
    top_p: 0.9
    max_tokens: 8192
    extra_body: {}
    chat_template_kwargs:
      enable_thinking: true
- model_name: "Fast"
  litellm_params:
    model: openai/microsoft/Phi-4-mini-instruct
    api_base: http://phi-judge:8000/v1
    api_key: simple-api-key
    temperature: 0.2
    top_p: 0.9
    max_tokens: 8192
    timeout: 30
    initial_messages:
      - role: system
        content: |
          You are a fast, local judge / evaluator / classifier.
    extra_body: {}
- model_name: "Expert" # Balanced reasoning
  litellm_params:
    model: openai/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
    api_base: http://host.docker.internal:8000/v1
    api_key: os.environ/NEMOTRON_API_KEY
    temperature: 0.6
    top_p: 0.98
    max_tokens: 16384
    timeout: 300
    extra_body: {}
    chat_template_kwargs:
      enable_thinking: true
      effort: "medium"
- model_name: "Heavy"
  litellm_params:
    model: openai/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
    api_base: http://host.docker.internal:8000/v1
    api_key: os.environ/NEMOTRON_API_KEY
    temperature: 1.0
    top_p: 1.0
    max_tokens: 32768
    timeout: 600
    stream_timeout: 240
    extra_body: {}
    chat_template_kwargs:
      enable_thinking: true
- model_name: "Code"
  litellm_params:
    model: openai/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
    api_base: http://host.docker.internal:8000/v1
    api_key: os.environ/NEMOTRON_API_KEY
    temperature: 0.2
    top_p: 0.95
    max_tokens: 16384
    timeout: 300
    stream_timeout: 180
    extra_body: {}
    chat_template_kwargs:
      mode: code
      enable_thinking: true
      clear_thinking: false

GPT‑OSS‑120B profiles

# ===================GPT OSS 120B ==========================================
- model_name: "Fast"
  description: Ultra-fast classifier / router / moderator – short & strict
  litellm_params:
    model: openai/gpt-oss-120b
    api_base: http://host.docker.internal:8010/v1
    api_key: sk-1234
    temperature: 0.2
    top_p: 0.9
    max_tokens: 2048
    timeout: 12
    initial_messages:
      - role: system
        content: |
          You are a fast, local classifier/router/moderator.
- model_name: "Expert"
  description: Balanced reasoning – good default quality/speed trade-off
  litellm_params:
    model: openai/gpt-oss-120b
    api_base: http://host.docker.internal:8010/v1
    api_key: sk-1234
    temperature: 0.55
    top_p: 0.95
    max_tokens: 32768
    timeout: 400
    extra_body: {}
    chat_template_kwargs:
      enable_thinking: true
      effort: medium
- model_name: "Heavy"
  description: Maximum reasoning depth & longest allowed outputs
  litellm_params:
    model: openai/gpt-oss-120b
    api_base: http://host.docker.internal:8010/v1
    api_key: sk-1234
    temperature: 0.75
    top_p: 0.98
    max_tokens: 65536
    timeout: 1200
    stream_timeout: 600
    extra_body: {}
    chat_template_kwargs:
      enable_thinking: true
      effort: maximum
- model_name: "Code"
  description: Programming & technical tasks – prefers clean output
  litellm_params:
    model: openai/gpt-oss-120b
    api_base: http://host.docker.internal:8010/v1
    api_key: sk-1234
    temperature: 0.15
    top_p: 0.85
    max_tokens: 32768
    timeout: 600
    extra_body: {}
    chat_template_kwargs:
      mode: code
      enable_thinking: true
      clear_thinking: true

2.5. How “modes” map to temperature / length / effort

The different “modes” (Fast, Expert, Heavy, Code) are just concrete per‑model profiles with carefully chosen parameters in litellm_config.yaml:

  • Temperature

    • Fast / classifier modes (Fast) use low temperature (around 0.15–0.2) and tighter top_p for short, stable, deterministic outputs (routing, judging, quick edits).
    • Expert modes (Expert) sit in the 0.4–0.6 range with slightly higher top_p for balanced reasoning and creativity.
    • Heavy modes (Heavy) use higher temperature and relaxed top_p so they can explore more options when you explicitly want deep thinking or long, open‑ended answers.
  • max_tokens / timeouts

    • Fast modes cap max_tokens low and use short timeouts to enforce quick responses and trigger fallbacks when a request clearly needs more room.
    • Expert modes increase max_tokens and timeouts to comfortably handle most daily chat and analysis.
    • Heavy / Max modes increase max_tokens aggressively and extend timeout / stream_timeout to cover long documents, full files, and multi‑step reasoning.
  • effort / thinking flags

    • Nemotron and GPT‑OSS profiles use enable_thinking: true and effort (medium, maximum) where supported to hint how hard to “think” internally before answering.
    • Code profiles set mode: code (and sometimes clear_thinking) to bias toward clean code blocks and minimal chatter.

When you pick Fastvs Expert vs Heavy vs Code, you’re really choosing a bundle of temperature + length + timeout + prompt discipline that you’ve pre‑curated for that model’s sweet spot, from Hugging Face, UnSloth model cards etc.


  1. Routing and fallbacks (concrete names, Grok‑4 fallback)

The final piece is the routing_strategy and router_settings. Using the same mode names "Fast" "Expert" for GL, Nemotron Nano etc with a simple, explicit fallback chain ending at Grok-4 results a clean UI and model dropdown menu.

# ── ROUTING & SETTINGS ───────────────────────────────────────────────────────
routing_strategy: usage-based-routing
router_settings:
  max_tokens_threshold: 4096
  fallback_on_max_tokens: true
  retry_on_failure: true
  num_retries: 2
  timeout: 60

# Corrected fallbacks – each fallback entry must list the model names that
# should be used when the primary model fails or when the token limit is hit.
fallbacks:
  - "Fast":
      fallback_to: ["Fast","Expert", "Heavy"]
  - "Expert":
      fallback_to: ["Expert","Heavy"]
  - "Heavy":
      fallback_to: ["Heavy","Grok-4"]

# Context‑window fallback mappings – ensure that when the primary model’s
# context window is exhausted, the system falls back to a model with a
# larger window.
context_window_fallbacks:
  "Fast":   ["Expert", "Heavy"]
  "Expert": ["Heavy"]
  "Heavy":  ["Heavy", "Grok-4"]

You keep one version of this block active at a time, depending on which Spark Arena recipe is your primary. Either way:

  • Fast classifiers escalate into their corresponding Expert/Heavy modes and then Grok-4.
  • The primary local assistant (Expert ) escalates to its Heavy profile and then Grok-4.
  • Code‑oriented models fall back to Qwen‑Coder profiles and, if needed, Grok-4.

No generic labels; everything is grounded in real model_name strings from your config, with Grok-4 as the single hosted backstop.


4. Phi Mini in the stack – the fast tier

Phi Fast runs Microsoft’s Phi‑4‑mini‑instruct as a dedicated small, fast tier, wired as its own vLLM instance on phi-judge:8000. Large quantized models like Nemotron‑3‑Nano NVFP4, GLM‑4.7‑Flash‑AWQ, and Qwen3‑Coder‑Next‑FP8 typically use < 62% of DGX Spark’s VRAM in their Spark Arena recipes, which leaves comfortable headroom to keep Phi‑4‑mini loaded at the same time together with Qdrant, embedding model and reranking model etc.

This gives you:

  • Sub‑second, ultra‑cheap responses for classification, routing, moderation, and quick Q&A.
  • A local‑only Fast tier: Fast is always tried first for short/simple tasks, and only escalates to Expert/Heavy/ Grok-4 when it hits timeouts, token limits, or context constraints via the routing rules above.

You get snappy, resilient behavior without complicating Open WebUI at all; it just exposes Fast as another model, and LiteLLM handles the routing.


5. What this gives you in practice

On a single DGX Spark:

  • Spark Arena recipes give you vLLM deployments for Nemotron, GLM, Qwen‑Coder, Qwen‑Instruct.
  • LiteLLM wraps those (plus small locals like Phi‑4‑mini and hosted models like Grok‑4, Qwen‑Max/Qwen3, Llama‑3‑70B) into a single OpenAI‑compatible endpoint with per‑model routing and fallbacks using real model names.
  • Open WebUI talks only to LiteLLM and simply lists Fast, Expert, Heavy, Code, etc. — nothing abstract, no alias layer.

The result is a hosted‑grade, multi‑model assistant experience with simple, concrete model choices and robust failover, running entirely on your hardware, with Spark Arena doing the heavy lifting underneath.

Auto mode will be added at a future date to automatically route to these basic modes, MoE and compound cloud/local MoE modes.

@raphael.amorim, @eugr thank you for providing the community Spark Arena, recipes etc, it is great having a foundation to build on! There is still more work to do here integration of Modes and Auto Mode, lets think about developing a community front end for the model in Spark Arena Models to provide a complete solution for Spark Users. I do like the openWebUI front end it does provide a lot of flexibility for customization. I do really like Qdrant as a backend vector database for openWebUI. The backend with RAG and pipelines are very simple and work well once you work out how to code them but very hard to get going the first time when you are trying to configure everything. I will post more on RAG and building RAG corpus at a future date.

Many thanks,

Mark

10 Likes

Thank you So much for these detailed and extremely useful guides! You have greatly helped this newbie along the path to spark’s greatness. Much appreciated and many kudos.

1 Like

Thanks for the kind words and your contributions Mark.
It’s very rewarding see the community growing and building together with lots of creative ideas. There’s more to come, some other folks in the community are reaching out and join forces so we can build a strong, collaborative and coherent ecosystem we all benefit from without stress and burnout

1 Like

Nice! Now if you add llama-swap to this setup, then you’ll be able to launch/switch models on demand. I run a similar stack - OpenWebUI → LiteLLM → llama-swap → vLLM (via our community docker/recipes)/llama.cpp, although since I’m constantly testing new models on Spark, llama-swap is temporarily out of service.

I like the idea of “modes” and more sophisticated request routing.

I use fallbacks in my setup, but they are to the same model on a different machine when the primary is not available (and for some models, to OpenRouter/Anthropic/OpenAI).

1 Like

Optimizing Multi-Model Workflows in Open WebUI on DGX Spark for Enhanced Performance

In developing AI pipelines, I have implemented sophisticated routing mechanisms, LiteLLM configurations, and fallback strategies. However, a subtle inefficiency persisted in my personal workflow when using Open WebUI with local models hosted on NVIDIA DGX Spark.

Previously, when evaluating responses from a local model, I would frequently compare them against outputs from external models such as Grok, Perplexity, or Claude. This required manually copying prompts, switching browser tabs, and executing queries separately—a process that disrupted focus and efficiency multiple times daily.

I have since discovered that Open WebUI natively addresses this through its multi-model capabilities, streamlining the comparison process.

Key Feature: Multi-Model Selection

Open WebUI enables the selection of multiple models directly from the dropdown menu, supporting both local and hosted options. Look at the model selector and you will see a “+” sign (in the screenshot below, see the “+” next to the Fast mode model selected). Just click “+” and add a new model or new model mode. After you have done this, you will see multiple icons for each model.

Upon submission, the prompt is dispatched in parallel to all selected models, with responses streaming back in real-time and displayed in adjacent panels or tabs for direct comparison.

Integrated Response Merging

For tasks requiring a unified response rather than individual evaluations, Open WebUI now includes native merging functionality for multi-model executions. By enabling the merge option after selecting models, the system automatically synthesizes a cohesive output from the various responses, without the need for custom aggregation scripts.

This feature operates under the hood, delivering a single, integrated reply that combines the strengths of local model depth with cloud model versatility.

How to See the Merged Response

It is not obvious how you merge responses. After all models have finished their answers, a small icon (three dots connected by lines) appears on the right under the response. Just click this icon and you will get the merged response.

Impact on Workflow

Formerly, I utilized Open WebUI primarily as an interface for local models, relying on separate tabs for cloud interactions. It now serves as a centralized platform, where DGX Spark-hosted models and external APIs like Grok, Perplexity, and Claude are accessible via simple checkboxes.

I select the appropriate model combination based on the task, activate merging as needed, and proceed efficiently. This approach minimizes context switching and manual integration efforts.

If you are an Open WebUI user and have not used or were unaware of this function give me a like on this post.

Big Shout out to the Open WebUI UI/UX developers on this function.

Happy Model Merging!

Mark

Sources:
[1] Features | Open WebUI ⭐ Features | Open WebUI
[2] Models | Open WebUI Models | Open WebUI
[3] Which model is used when using the ‘merge responses’ function? Which model is used when using the 'merge responses' function? · open-webui/open-webui · Discussion #6640 · GitHub
[4] Configure the model for merging responses · Issue #9490 - GitHub Configure the model for merging responses · Issue #9490 · open-webui/open-webui · GitHub
[5] I’m the Sole Maintainer of Open WebUI — AMA! : r/OpenWebUI Reddit - The heart of the internet
[6] feat: Add Edit and Delete Buttons to Model Selection Dropdown #2869 feat: Add Edit and Delete Buttons to Model Selection Dropdown · open-webui/open-webui · Discussion #2869 · GitHub
[7] How to Use LLAVA Multimodal with OpenWebUI & GPT-4 to Analyze … https://www.youtube.com/watch?v=yZkmolyV0Zk
[8] default model settings across multiple models - ollama model file Reddit - The heart of the internet
[9] Guide to extending OpenWebUI using Pipelines Guide to extending OpenWebUI using Pipelines
[10] Enhance Multi-Model Comparison View with Expandable Output … feat: Enhance Multi-Model Comparison View with Expandable Output Cards and Horizontal Model Selection · Issue #13070 · open-webui/open-webui · GitHub
[11] Open-Webui Mixture of Agents part 2 - LinkedIn Open-Webui Mixture of Agents part 2
[12] How To connect models to OpenWebUI How To connect models to OpenWebUI - UbiOps Technical Documentation
[13] Run Open Source Multimodal Models Locally Using Ollama - YouTube https://www.youtube.com/watch?v=1VBwYM6_xww
[14] feat: Optimize RAG Pipeline with Async, Parallelism, and … feat: Optimize RAG Pipeline with Async, Parallelism, and Batching · open-webui/open-webui · Discussion #13966 · GitHub
[15] Tools | Open WebUI Tools | Open WebUI

2 Likes

Nice setup. What do you use to measure the accuracy of your system? Any issues with hallucinations?

So far I’ve only had a play with Ragas.

see below: Edit Update Error

Re: Hallucination testing — phased approach with academic grounding

Thanks again for the Ragas pointer — hadn’t come across it before, so that’s now firmly on the list to dig into properly.

The initial focus has been on academic STEM research. I have self‑hosted the full arXiv dataset — approximately 3 million papers — split across 10 Qdrant collections by subject area, and are getting good retrieval results from a hybrid dense + sparse setup. On top of that I have built 15 Open WebUI tools to query external academic databases in parallel, covering Semantic Scholar, OpenAlex, CORE, Connected Papers, Europe PMC, PubMed, Perplexity Academic, Dimensions/Lens, ResearchGate/Academia, ArXiv Recent Alerts, AI2 Asta, Google Gemini Academic, a research LLM, and a consolidated academic paper and metadata fetcher. The intent is that any claim the model makes can be cross‑referenced against multiple authoritative sources simultaneously — for STEM research this gives you a verifiable ground truth that’s far more reliable than general web search for hallucination testing.

For the broader evaluation layer the approach is explicitly two‑phased. The primary judge layer is built around free‑tier models on OpenRouter: there are some genuinely strong models available at zero cost, and the goal is to extract meaningful consistency and agreement signals at scale without touching paid APIs. Only if that proves insufficient for a given use case do we escalate to paid frontier models — Claude, DeepSeek, Grok, Perplexity — as high‑fidelity judges that can provide more nuanced grading and rationale.

Some of the Spark Arena local models running on the DGX Spark are also emerging as strong judge candidates in their own right. Nemotron Nano supports a native “thinking mode” that surfaces its full reasoning chain rather than just a verdict, which makes disagreements between judge and task model much more actionable. GLM‑4.7 Flash offers similar structured reasoning at a much lower GPU cost, making it a good lightweight local judge for higher‑volume checks. That means the judge layer can blend capable local models (with visible reasoning and zero API cost) and cloud models, giving multiple independent perspectives on the same answer.

LangChain handles the orchestration logic and routing between these judge tiers, and LangFlow provides a visual GUI to iterate on and tune the evaluation pipelines — prompts, routing logic, thresholds, and escalation rules — without constantly rewriting code. Checkout this excellent Nvidia Guide on using Langflow on the RTX.

Beyond arXiv, I have already started extending Qdrant with grounded technical corpora: official documentation across Python, Docker, Debian, Linux, JavaScript, AppleScript, and Swift/Xcode stacks, alongside core textbooks and curated coding datasets from Hugging Face. The goal is that implementation and coding answers are evaluated against authoritative, versioned references rather than just whatever happens to be in the pretraining mix. If a model’s answer contradicts the official docs or a known benchmark dataset, that becomes a clear, automated signal rather than a subjective judgment call.

All of this is still very much under construction — there’s a fair bit of plumbing in progress to connect the retrieval, judging, and orchestration layers cleanly. Once it’s end‑to‑end tested and producing stable metrics, I’ll post a proper update, pipelines, tools, LangFlow etc with concrete results.

Mark

2 Likes
1 Like

Thank you so much for sharing!

Going to take me a few to go over moreover wanted to say thanks.

Just bought a spark in Feb 2026. This is my first post. This is excellent. Thank you so much for sharing your knowledge here.

1 Like

Thanks for sharing your experience Mark :)

I recently set up my own RAG for OWUI and went a different route (different needs), so I have a couple of questions:

  • What are you using as the pre-processor?
    That would influence the quality of the content you send to the embedder. I went with Docling, but have to avoid OCR as it eats too much VRAM
  • How do you check the output of the document processing?
    OWUI shows you the markdown output right after it’s down if you click on the file, so you can evaluate how it went.
  • What happens with pictures in the documents?
    I’m using a multimodal model (Qwen3.5) as my main LLM so that I can get descriotions)
  • How do you select specific collections when asking questions?
    In OWUI, collections or files can be invoked when needing to have them in the context

Thank you! :)

1 Like

@AoE, thanks again for the thoughtful questions—you’re spot on that preprocessing and collection selection are where most RAG quality lives. My stack is still evolving, but here’s the current state and my next priorities.

1. Arxiv + domain layer structure
My initial focus has been on self hosting the full librarian_bots/arxiv_metadata_snapshot (~2.96M papers, JSONL, ~4.8 GB) with IDs, titles, authors, abstracts, subjects, DOIs, and journal refs.[1][2]
Instead of a single huge collection, I split into 14 subject-area collections in Qdrant following Arxiv categories roughly:

  • arxiv-cs-ml-ai — ML/AI/neural nets
  • arxiv-cs-systems-theory — systems, complexity, algorithms
  • arxiv-math-pure, arxiv-math-applied, arxiv-math-phys
  • arxiv-stat-eess — stats, signal/electronic systems
  • arxiv-quantph-grqc — quantum + GR
  • arxiv-hep — high-energy physics
  • arxiv-condmat — condensed matter
  • arxiv-astro — astrophysics
  • arxiv-nucl-nlin-physother — nuclear, nonlinear, misc physics
  • arxiv-qbio-qfin-econ — q-bio, q-fin, econ
  • arxiv-misc — catch-all

Plus curated domain collections (Attention_Mechanisms, Dropout_Techniques, Batch_Normalization, Residual_Networks, Transformer_Architectures, Optimization_Methods, etc.) for targeted retrieval over broad search.[3][1]

Scope note: this layer is metadata + abstracts + citation structure only—no full PDFs yet. Arxiv metadata is surprisingly powerful for idea tracing and citation graphs; full-text PDFs are a parallel (trickier) workstream.[2][1]

2. Preprocessing: Docling vs current stack
I’ve reviewed Docling (including VLM-enriched variants like SmolDocling). On DGX Spark, full OCR + VLM can be heavy for mostly clean academic PDFs.[4][5]
Current pipeline:

  • Primary: PyMuPDF (fitz) — fast, reliable on standard layouts
  • Fallback: unstructured for tricky multi-column/tables

Big limitation: figures/diagrams are dropped or become garbage—a huge loss for STEM (architecture diagrams, plots).

Fix in progress: multimodal ingestion pass

  • Render page → image
  • Quantized VLM (likely Qwen2-VL or Pixtral) → generate descriptions/captions
  • Store descriptions in Qdrant payload alongside text chunks (searchable/embeddable)

All offline at ingestion → zero query-time cost. Not live yet.

3. Validating processing
I ignore Open WebUI file preview. Ingestion writes cleaned text + metadata (pages, future figure captions) to a processing-audit JSONL log. I sample-check for encoding breaks, layout concatenation errors, etc. Once multimodal is on, VLM captions go into Qdrant payloads for direct querying (e.g., “find the Transformer diagram mention”).

4. Pictures: current vs roadmap
Current: ignored. Retrieval falls back to caption text at best; visual content is lost.
Roadmap: persistent VLM captions embedded/stored per figure/page. That enables queries like “Transformer architecture diagram from Vaswani et al.” to retrieve the description and map to the PDF/asset.

Your suggestion of Qwen2.5-VL on-the-fly at inference for ad-hoc uploads is excellent (especially for interactive reading). For my persistent corpus I want descriptions as first-class retrieval primitives, but the approaches complement each other—I’ll likely run both.[1]

5. Why Qdrant + hybrid search wiring
Qdrant chosen for:[6][7][8]

  • Native hybrid (dense + sparse named vectors, built-in RRF via Universal Query API—no custom fusion)
  • HNSW in RAM + on-disk payload (on_disk_payload: true) → millions of abstracts without RAM explosion
  • Collection isolation (independent re-index/WAL per subject area)

HNSW is tuned around m=16, ef_construct=100; larger collections get higher memmap_threshold for disk mapping.[6]

6. Embeddings: BGE-M3 across the board
Three containerized BGE-M3 roles:[9][10]

  • dense: 1024-dim (cosine)
  • sparse: SPLADE-style token weights (exact matches: authors, arXiv IDs like 2305.14314, math notation)
  • reranker: cross-encoder on fused candidates

RRF merges: strong in both → boosted; strong in one only → penalized. In practice it’s a big win on technical retrieval.[7][11][6]

Surprise: all three are lightweight on Spark—tiny GPU footprint vs the main model (Qwen3-80B etc.).[1]

NeMo Retriever comparison
I did review NVIDIA’s NeMo Retriever docs and benchmarks while building this stack.[12][13][9] The Llama-3.2-NeMo-Retriever-1B-VLM-Embed model delivers impressive multimodal accuracy—up to 84.5% Recall@5 on Digital Corpora datasets and strong performance on financial documents like earnings reports.[12] Their full extraction pipeline (nv-ingest + nemoretriever-parse) handles charts, tables, and infographics, with NVIDIA citing up to 15× faster multimodal PDF extraction than open-source alternatives.[12][13]

The trade-off on Spark is resource footprint. Running the full NeMo Retriever stack locally (embedding + rerank + multimodal extraction NIMs) assumes dedicated GPU slices—nemoretriever-parse really wants its own GPU, with documentation noting you need ~24 GB VRAM minimum for optimal throughput.[13][9] On Spark, that footprint directly competes with the main event: the 80B parameter generation model. BGE-M3 containers sip resources in comparison, coexisting comfortably alongside vLLM/Qwen3-80B on one node. For enterprise deployments with dedicated infrastructure, the full NeMo stack is absolutely the right choice.[12] On an edge box where generation is the priority and VRAM is the hard bottleneck, “lighter with good enough performance on academic text” wins for now.[1][13]

7. Query-time collection routing
Users see none of this. (OpenAI-compatible /v1/chat/completions) handles:[1]

  1. Cache check (Valkey/Redis)
  2. Small router (e.g. GLM-4-9B-Chat) classifies query
  3. Routes to 1+ collections (e.g. “dropout vs batch normalization” → arxiv-cs-ml-ai + Dropout_Techniques; “residual connections in transformers” → Transformer_Architectures + Residual_Networks; “Adam optimizer convergence” → Optimization_Methods + arxiv-cs-ml-ai)
  4. Parallel dense + sparse embed
  5. Hybrid RRF search
  6. Optional SearXNG for web/time-sensitive → merge
  7. BGE-M3 rerank
  8. Prompt → vLLM (Qwen3-80B / Phi-4-mini)
  9. Citations, confidence, cache write, Langfuse trace

LangGraph orchestrates this (parallel branches, conditionals, fallbacks) while exposing a clean API to Open WebUI/LiteLLM.[1]

8. vs Open WebUI built-ins
Even without figures, subject routing + hybrid RRF + rerank + cache/web fusion beats flat-file blobs (no hybrid, no routing, no rerank). Multimodal captions should widen the gap significantly for visual STEM content once they’re live.[6][1]

Happy to share stripped-down code/config—Qdrant hybrid schema (named vectors + RRF), embedding service setup, or LangGraph wiring. Which would you want first?

Sources
[1] Building Local + Hybrid LLMs on DGX Spark That Outperform Top … Building Local + Hybrid LLMs on DGX Spark That Outperform Top Cloud Models
[2] arXiv Dataset arXiv Dataset | Kaggle
[3] arxiv-search by WILLOSCAR - Agent Skills arxiv-search by WILLOSCAR | Agent Skills
[4] Understanding Docling for Structured Document Processing Understanding Docling for Structured Document Processing
[5] Docling: The Document Alchemist | Towards Data Science Docling: The Document Alchemist | Towards Data Science
[6] Hybrid Search Revamped - Building with Qdrant’s Query API Hybrid Search Revamped - Building with Qdrant's Query API
[7] Demo: Implementing a Hybrid Search System - Qdrant Demo: Implementing a Hybrid Search System - Qdrant
[8] Hybrid Search and the Universal Query API - Qdrant Hybrid Search and the Universal Query API - Qdrant
[9] BGE-M3 — BGE documentation BGE-M3 — BGE documentation
[10] BGE-M3 — BGE documentation BGE-M3 — BGE documentation
[11] BAAI/bge-m3 · Reranker - Hugging Face BAAI/bge-m3 · Reranker
[12] NVIDIA NeMo Retriever Delivers Accurate Multimodal PDF Data Extraction 15x Faster https://developer.nvidia.com/blog/nvidia-nemo-retriever-delivers-accurate-multimodal-pdf-data-extraction-15x-faster/
[13] Run Multimodal Extraction for More Efficient AI Pipelines Using One GPU Run Multimodal Extraction for More Efficient AI Pipelines Using One GPU | NVIDIA Technical Blog

5 Likes

@AoE

Just trying to replicate what you have done with Docling but not able to get it working with GPU.

Question for you: Have you gotten Docling working with GPU acceleration? Or are you running it in CPU mode too?

If you did manage to get Docling working with GPU acceleration can you please share Dockerfile and .yaml and server.py.

Many thanks in advance.

Mark

I’m using the cuda130 image and it should be using the GPU, but I haven’t run any tests to specifically check if it was.

Here is the compose file. I’m not compiling anything, but based on the recipe for docling-serve, I’m sure you can get it to perform better with a custom image.

  docling:
    image: ghcr.io/docling-project/docling-serve-cu130:main
    container_name: docling
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      DOCLING_SERVE_ENABLE_UI: false
      DOCLING_SERVE_ENABLE_REMOTE_SERVICES: true
      DOCLING_SERVE_ALLOW_CUSTOM_VLM_CONFIG: true
      DOCLING_SERVE_ALLOW_CUSTOM_PICTURE_DESCRIPTION_CONFIG: true
      DOCLING_SERVE_ALLOW_CUSTOM_CODE_FORMULA_CONFIG: true
      DOCLING_SERVE_MAX_SYNC_WAIT: 600
      DOCLING_SERVE_ENG_LOC_NUM_WORKERS: 4
      DOCLING_NUM_THREADS: 4
      UVICORN_WORKERS: 1
      DOCLING_DEVICE: cuda
      DOCLING_PERF_PAGE_BATCH_SIZE: 16
      DOCLING_SERVE_OCR_BATCH_SIZE: 2
      DOCLING_SERVE_LAYOUT_BATCH_SIZE: 16
      DOCLING_SERVE_TABLE_BATCH_SIZE: 4
    volumes:
      - /opt/share/models/docling:/opt/app-root/src/.cache/docling/models
      - /opt/share/models/docling/torch_kernels:/opt/app-root/src/.cache/torch/kernels
      - /opt/share/uploads:/uploads:ro
      - /opt/share/docling/chunks:/data/chunks:rw
    command: >
      bash -c "docling-serve run"

And thanks for the previous reply. I still need to dig into the detailed information that you provided :)