LiteLLM: The Control Plane Your DGX Spark Stack Actually Needs
Hi Martin — orchestrating local LLMs is genuinely complex and I’m still refining my own setup. Before diving into configs, this graphic landed on my Instagram feed today and it’s the best mental model I’ve seen for framing the problem.
https://www.instagram.com/p/DVJAcNDgGG-/?igsh=bHpzY3k0MGtjY3p3
The insight: LLM → RAG → Agent → Agentic AI is a layered stack, not a tok/s decision. Each outer layer requires all the inner ones. You can’t have a good agent without good RAG. You can’t have good RAG without a capable LLM. The pyramid compounds — which is exactly why benchmark fixation misses the point.
Three Stages of Orchestration
Stage 1 — Simple model picker (LLM layer) Open WebUI fronts LiteLLM as a unified backend. The “+” in the model selector gives you profiles (Fast, Expert, Code, Cloud Models etc ) all routing through one proxy. Local vLLM or cloud — same API, same UI.
Stage 2 — Ops-driven routing LiteLLM becomes a traffic cop: config-driven fallbacks, load balancing, zero downtime when local vLLM is loading. Don’t stress the model switch — just route to OpenRouter’s free tier. Many Spark Arena models are available there with limits that aren’t an issue for local fallback use.
Stage 3 — Intelligent task-aware selection (RAG + Agent layers) A small classifier (Phi-mini or LangGraph) analyzes each query: “deep research → Expert+RAG profile on local 80B,” “quick code → Code profile,” “needs live data → web search + tools.” LiteLLM executes; your orchestrator decides. Clean separation of concerns.
The Features Nobody Talks About
LiteLLM UI — Hit localhost:4000/ui on your proxy. Live request logs, model availability, usage graphs, key rotation. See exactly which layers of the graphic your queries are hitting, in real time, no cloud console needed. Massively underrated.
LiteLLM DB — Every request gets logged: model, tokens, latency, cost. After a week you know “local Qwen handled 93% of queries, OpenRouter fallback cost $0.” I’m still investigating this — early days.
Claude Code + Local Models (via LiteLLM Proxy + vLLM)
I have just started to look at using local DGX models with Claude Code. Any input most appreciated. This is what I am think should work, but still need to implement.
Claude Code expects Anthropic’s Messages API (/v1/messages endpoint), but vLLM serves an OpenAI-compatible API (/v1/chat/completions). LiteLLM bridges this perfectly — it exposes a native Anthropic-compatible endpoint (/v1/messages or unified pass-through) while routing to your vLLM instance.
Quick Setup Recap (vLLM-focused)
In LiteLLM config.yaml, map fake Claude model names to your vLLM backend:
-
model_list:
- model_name: claude-3-sonnet-20240229 # Claude Code often requests this
litellm_params:
model: openai/Qwen/Qwen3-Coder-Next-32B-Instruct # prefix with openai/ for vLLM
api_base: http://localhost:8000/v1 # your vLLM endpoint
# optional: api_key: "token-abc123" if vLLM has auth enabled
- model_name: claude-3-opus-20240229
litellm_params:
model: openai/DeepSeek/DeepSeek-Coder-V3-236B-Instruct
api_base: http://localhost:8000/v1
- model_name: claude-* # wildcard catch-all (very useful)
litellm_params:
model: openai/Qwen/Qwen3-Coder-Next-32B-Instruct
api_base: http://localhost:8000/v1
-
Point Claude Code to your local proxy:
export ANTHROPIC_BASE_URL="http://localhost:4000" # or http://localhost:4000/anthropic for pass-through
export ANTHROPIC_API_KEY="sk-1234abcd" # dummy or your litellm master key
# Optional: force default model
export ANTHROPIC_MODEL="claude-3-sonnet-20240229"
claude --model claude-3-sonnet-20240229 # matches your config mapping
Why LiteLLM Should Enable This Seamlessly
- Protocol Translation — Converts Anthropic Messages API calls to OpenAI format for vLLM on-the-fly — no changes needed in Claude Code.
- Model Aliasing + Wildcards — Catch any
claude-* request and route to your best local coder Qwen3-Coder-Next.
- Observability Bonus — Hit
localhost:4000/ui for real-time logs, per-request latency, token counts, and model usage graphs — invaluable when debugging long agent loops or comparing quantized vs. FP16 runs.
- Progressive Enhancement — Easily add fallbacks to hosted Claude Models
Current Status
| Component |
Status |
| LiteLLM routing |
🔥 Production-ready |
| OpenRouter free tier fallbacks |
âś… Works great |
| Multi-model selection (Open WebUI) |
âś… Solid |
| LiteLLM UI |
âś… Massively underrated |
| LiteLLM DB / telemetry |
🔍 Investigating |
| LangFlow integration |
đź§Ş Prototyping only |
| Claude Code via proxy |
đź§Ş Not ready for prime time |
The beauty of this stack: start at Stage 1 and add layers progressively. LiteLLM grows with you. A mediocre model with excellent RAG, tools, and orchestration will beat GPT-4 with none of the above. Build the whole pyramid.
— Mark