Managing Local LLM Orchestration

Hey Guys, I can ask an AI but honestly I am more interested in your experience and advice.
Have to say I am a bit overwhelmed on the choice and setup procedures.

When you want to make use of the Spark for different purposes you often need different models. We all need to Manage, Load, Unload and Orchestrate local LLMs on the DGX Spark.

Inference Engines (Loading, Unloading, and Swapping)
I started with Ollama - cause its the easiest to start with. But it is quite slow.
Now I use vLLM and Lama-Swap to load and unload the models.
It works, the models load but that takes quite some time espl. for big models.

Is lama.cpp faster in that?

I heard there is also SGLang and LlamaEdge and now Atlas.

Routers & Gateways (Deciding which model to use)
I read LiteLLM quite often here so guess that is a good Router to use.
What other alternatives are there or where are the benefits and disadvantages?

What is your go to Workflow?

Yes I searched the forum:

1 Like

LiteLLM: The Control Plane Your DGX Spark Stack Actually Needs

Hi Martin — orchestrating local LLMs is genuinely complex and I’m still refining my own setup. Before diving into configs, this graphic landed on my Instagram feed today and it’s the best mental model I’ve seen for framing the problem.

https://www.instagram.com/p/DVJAcNDgGG-/?igsh=bHpzY3k0MGtjY3p3

The insight: LLM → RAG → Agent → Agentic AI is a layered stack, not a tok/s decision. Each outer layer requires all the inner ones. You can’t have a good agent without good RAG. You can’t have good RAG without a capable LLM. The pyramid compounds — which is exactly why benchmark fixation misses the point.


Three Stages of Orchestration

Stage 1 — Simple model picker (LLM layer) Open WebUI fronts LiteLLM as a unified backend. The “+” in the model selector gives you profiles (Fast, Expert, Code, Cloud Models etc ) all routing through one proxy. Local vLLM or cloud — same API, same UI.

Stage 2 — Ops-driven routing LiteLLM becomes a traffic cop: config-driven fallbacks, load balancing, zero downtime when local vLLM is loading. Don’t stress the model switch — just route to OpenRouter’s free tier. Many Spark Arena models are available there with limits that aren’t an issue for local fallback use.

Stage 3 — Intelligent task-aware selection (RAG + Agent layers) A small classifier (Phi-mini or LangGraph) analyzes each query: “deep research → Expert+RAG profile on local 80B,” “quick code → Code profile,” “needs live data → web search + tools.” LiteLLM executes; your orchestrator decides. Clean separation of concerns.


The Features Nobody Talks About

LiteLLM UI — Hit localhost:4000/ui on your proxy. Live request logs, model availability, usage graphs, key rotation. See exactly which layers of the graphic your queries are hitting, in real time, no cloud console needed. Massively underrated.

LiteLLM DB — Every request gets logged: model, tokens, latency, cost. After a week you know “local Qwen handled 93% of queries, OpenRouter fallback cost $0.” I’m still investigating this — early days.

Claude Code + Local Models (via LiteLLM Proxy + vLLM)

I have just started to look at using local DGX models with Claude Code. Any input most appreciated. This is what I am think should work, but still need to implement.

Claude Code expects Anthropic’s Messages API (/v1/messages endpoint), but vLLM serves an OpenAI-compatible API (/v1/chat/completions). LiteLLM bridges this perfectly — it exposes a native Anthropic-compatible endpoint (/v1/messages or unified pass-through) while routing to your vLLM instance.

Quick Setup Recap (vLLM-focused)

In LiteLLM config.yaml, map fake Claude model names to your vLLM backend:

  1. model_list:
      - model_name: claude-3-sonnet-20240229          # Claude Code often requests this
        litellm_params:
          model: openai/Qwen/Qwen3-Coder-Next-32B-Instruct   # prefix with openai/ for vLLM
          api_base: http://localhost:8000/v1                 # your vLLM endpoint
          # optional: api_key: "token-abc123" if vLLM has auth enabled
    
      - model_name: claude-3-opus-20240229
        litellm_params:
          model: openai/DeepSeek/DeepSeek-Coder-V3-236B-Instruct
          api_base: http://localhost:8000/v1
    
      - model_name: claude-*                         # wildcard catch-all (very useful)
        litellm_params:
          model: openai/Qwen/Qwen3-Coder-Next-32B-Instruct
          api_base: http://localhost:8000/v1
    
  2. Point Claude Code to your local proxy:

    export ANTHROPIC_BASE_URL="http://localhost:4000"          # or http://localhost:4000/anthropic for pass-through
    export ANTHROPIC_API_KEY="sk-1234abcd"                     # dummy or your litellm master key
    # Optional: force default model
    export ANTHROPIC_MODEL="claude-3-sonnet-20240229"
    
    claude --model claude-3-sonnet-20240229                    # matches your config mapping
    

Why LiteLLM Should Enable This Seamlessly

  • Protocol Translation — Converts Anthropic Messages API calls to OpenAI format for vLLM on-the-fly — no changes needed in Claude Code.
  • Model Aliasing + Wildcards — Catch any claude-* request and route to your best local coder Qwen3-Coder-Next.
  • Observability Bonus — Hit localhost:4000/ui for real-time logs, per-request latency, token counts, and model usage graphs — invaluable when debugging long agent loops or comparing quantized vs. FP16 runs.
  • Progressive Enhancement — Easily add fallbacks to hosted Claude Models

Current Status

Component Status
LiteLLM routing 🔥 Production-ready
OpenRouter free tier fallbacks âś… Works great
Multi-model selection (Open WebUI) âś… Solid
LiteLLM UI âś… Massively underrated
LiteLLM DB / telemetry 🔍 Investigating
LangFlow integration đź§Ş Prototyping only
Claude Code via proxy đź§Ş Not ready for prime time

The beauty of this stack: start at Stage 1 and add layers progressively. LiteLLM grows with you. A mediocre model with excellent RAG, tools, and orchestration will beat GPT-4 with none of the above. Build the whole pyramid.

— Mark


2 Likes

I don’t know if it was my original intention, but sparkrun [forum: Sparkrun - central command with tab completion for launching inference on Spark Clusters - #40 by dbsci] is sort of evolving to overlap with llama-swap a bit.

sparkrun is intended for starting/stopping inference models on spark and to unify how you approach it whether single node or cluster or clusters – or using vllm, sglang, llama.cpp, trt-llm, etc. And the idea is that, with https://spark-arena.com, we’re going to make it easy for people to find recipes and then run them.

And due to everyone really relying on litellm, I’ve also added litellm proxy functionality to sparkrun (so that it automatically configures litellm with your running sparkrun models and supports aliases for models as well as dynamically updating config). If people are using it that way, I was also considering future support for it automatically starting/stopping models based on demand – but that’s up in the air at this point.

So I think sparkrun is a good way to orchestrate local LLMs… but I’m biased. There is also a claude code plugin so that you can tell claude (or your local LLM ;-) ) to start/stop models.

2 Likes

I plan on using my ASUS GX10 as primarily the planner & orchestrator. I have a dual GPU setup on a workstation (Quadro RTX 8000 x2) for primarily coding inference (I’ve got the 122B running at 37 tokens/sec & fast prefill using LM Studio. Need to find the time to try vLLM to see if I can push that a little faster. Ollama was giving me 25/sec. LM Studio with speculative decoding gave me 32/sec and when I stacked speculative decoding I pushed it to 37/sec (so far). That leaves my desktop with a 4090 & 2 3060’s to handle small models for debugging and documentation. The environment is solely for me, not shared, so it should be fast enough. I would not try to run an entire suite of models for agentic needs on a single GB10 system unless you have the cash to string together 2 or more. The main model likely that can handle all roles for agentic would likely be Qwen3.5-122B but I’m not sure yet - if I were limited to only a single robust system it would be that model (for now).

As far as agentic frameworks - honestly I think one could either manipulate Claude Code to use local LLMs (there’s a guide somewhwhere) but my preference is VS Code with a seasoned (if you consider barely 1 year of build-out as “seasoned”) plugin which can be configured with multiple agent types (planner, orchestrator, coder, etc.). Gook that into Git… not sure how much better it can get. I’m looking forward to finalizing my setup.

Are you seeing it possible to concurrently load models with TP>1 using sglang? When I looked into it with vllm, Ray was the bottleneck and Kuberay wasn’t a path to get around the GPU consumption. Had to go hacky and load an embedding model using llama.cpp in CPU mode. Have since stopped because of the firmware struggles and just put that on my dev machine. But loading a 120-250B, a 20-35B and an embedding would be pretty great.

Yes. For example, I have run simultaneous small models on 4x node cluster (using --tp 4 for all models) by allocating less RAM per model. You do need to be careful about overallocating though – it becomes on you to manage that. (use the --gpu-mem option alongside careful consideration about RAM usage). Once you find what you like, you can keep/manage custom recipes for yourself (to avoid needing more CLI options). (And sparkrun enables adding registries, so you can keep your registries in a git repo and possibly share them if you wish)

sparkrun automatically increments ports for ray/torch to avoid conflicts, etc. when coordinating multiple models. (It does not auto-increment your desired port for the model itself though – because that might cause problems… however, the sparkrun proxy functionality was designed to also handle that part)

You can also do it with vllm with ray if you have multiple ray instances also, but that’s just extra overhead. That’s actually why the default mode for vllm with sparkrun is actually to use torch distributed rather than ray.

For sglang, it’s always with torch distributed. For vllm, it can either be with torch distributed or in ray mode.

Note that using vllm recipes from eugr’s repo via sparkrun default to using ray instead of torch distributed to maintain compatibility. But if executed via sparkrun, the logistics for running multiple ray instances to allow for multiple models is handled for you.

And for sglang recipes with sparkrun, it’s always using torch distributed at this point.

1 Like

Hey Mark, thanks for the detailed reply.

That helps a lot. I passed stage 1 (let Open Web Ui pick the Model manually) and currently on Stage 2 using lama-swap to load and unload the vllm models - but now I will set up LiteLLM too… as an addition or replacement of Lama - Swap.

I just asked gemini for a comparison and it says LiteLLM can not load and unload local LLMs and that Lama-Swap shall still be needed.

Feature LiteLLM Llama-Swap
Primary Goal Unifying APIs, load balancing, cost tracking. Hot-swapping local models to save VRAM.
Target Environment Cloud and Production. Local Development and Home Servers.
Model Lifecycle Routes requests; does not start/stop local models. Actually starts and stops the local server processes.
API Support 100+ Providers (Cloud & Local). Any local server with an OpenAI-compatible endpoint.
Resource Footprint Moderate (Python, optional database). Very low (Compiled Go binary).

Another good thing you mentioned is the fallback for Open Router. That is a good idea. And also the Trick with Claude Code making it use local LLMs instead of the Anthropic Models.
I use OpenCode instead of Claude cause I did not know about this before. But Have to say that I do not miss anything - I can even add the Anthropic Models Like Opus 4.6 for complex tasks too.

Thanks for your reply dbsci, sparkrun sounds interesting and very useful.

Espl. with the spark-arena.com recipes that work on the spark. Wasted countless hours with Gemini and Claude to get the right settings for different models to load and run stable on the Spark. Feels a bit like 1992 when you had to write your own Autoexec.bat and Config.sys to make all the Soundcard and Video Drivers load for your favourite game and still be able to boot without it crashing the system cause of the limited 640 kilobytes of the memory.

A easy and reliable way of downloading and running models is highly appreciated.
We all know this is a device for developers but I thought I develop WITH ai and not FOR ai.
Meaning I want to do stuff with ai.. not spend to much time to make the ai work in the first place.

@dbsci

Drew,

Very interested in your earlier comment in this thread on LiteLLM integration with SparkRun


”And due to everyone really relying on litellm, I’ve also added litellm proxy functionality to sparkrun (so that it automatically configures litellm with your running sparkrun models and supports aliases for models as well as dynamically updating config).”***

This would be extremely useful…. Do you have any documentation on this? I just checked https://sparkrun.dev/ and could not find anything.

Many thanks

Mark

Haha good times – DOS and sound card compatibility and managing extended mode memory via TSRs – I remember those days!

Yeah that’s the idea… sparkrun is meant to handle the logistics so that you can be more focused on actually doing stuff with AI. One of the many reasons I made sparkrun was actually for automating the process of trying to find the best base model for a task before engaging in fine tuning… so being able to run sparkrun via claude code plugin was a super nerdy fun moment to have claude code start a model (with tp4) , run a short eval set, save result, and iterate – relying on my hardware to do it.

I’ve basically done it as a “soft launch” because I honestly haven’t put it through its paces. There are actually a bunch of undocumented sparkrun features (eek) because they are things that I am still testing.

Here is the “announcement” forum link (that has some limited info): Sparkrun - central command with tab completion for launching inference on Spark Clusters - #35 by dbsci

I’ll post something more detailed shortly.

@griffith.mark

Short docs here:

Longer docs on website: proxy | sparkrun