Running a Full LLM Stack on DGX Spark GB10 (Your Application -> LiteLLM -> llama-swap -> vLLM / llama.cpp / Ollama)

Running a Full Multi-Model LLM Stack on DGX Spark (GB10) — With VRAM Orchestration

Hey Guys, I (and the AI) have been working on this for quite a while and this is what I have so far.
Once it is set up it shall make using the LLMs as easy as possible — the loading and unloading is done by llama-swap. No matter if you use vLLM, llama.cpp or Ollama.
LiteLLM is used to route the LLMs. You can also use it for fallbacks or adding remote models to the stack.

The biggest issue so far is speed. It works reliably but it is not fast.
If someone has a better, easier, faster, more reliable solution please let me know.

After working on this for weeks and having a working solution I asked the AI to help generate a tutorial for it and this is what came out.


Hardware: NVIDIA DGX Spark (Grace Blackwell GB10) — 128 GB unified CPU/GPU memory, SM12.1 GPU architecture, ARM64 (SBSA) CPU

What this stack gives you: A single OpenAI-compatible API endpoint that dynamically swaps 10+ models in and out of VRAM on demand — 4B GGUF models, 30B FP8, 120B+ MoE, VLMs — with no manual docker run required. Requests come in through LiteLLM, which routes to llama-swap, which spins up the right vLLM/llama.cpp container for that model and kills it again after idle timeout.

This guide builds directly on the outstanding work of:

  • @eugrspark-vllm-docker provides the pre-built vLLM + FlashInfer wheels compiled specifically for the GB10. Without this, building a working vLLM image from source takes 2–4 hours and frequently breaks on nightly. He also authored llama-benchy, the standardized benchmarking tool used throughout this guide. Massive thanks — this stack would not be usable in practice without both of those projects.
  • @christopherowenspark-vllm-mxfp4-docker and the associated forks of vLLM, FlashInfer, and CUTLASS that enable native MXFP4 quantization on GB10. This is what makes OpenAI GPT-OSS-120B actually run at ~57 tok/s on a single Spark.

GitHub repo (all Dockerfiles + configs): GitHub - mARTin-B78/dgx-spark_lite-llm_llama-swap_vllm_llama-cpp_ollama: LLM Stack for nVidia DGX Spark containing LiteLLM, LamaSwap, vLLM, Llama.cpp and ollama · GitHub


Architecture Overview

Your Client like Open-WebUI / VSCode / OpenClaw / Homeassistant / etc...
        ↓
   LiteLLM :14000          ← unified API key, routing, model aliases
        ↓
  llama-swap :28080         ← VRAM orchestrator, loads/evicts on demand
     /    |    \
vLLM   vLLM   llama.cpp    ← one ephemeral container per model
(30B)  (120B)  (GGUF)

All containers share the dgx_net Docker bridge network. Model containers attach to llama-swap’s network namespace (--network container:llama-swap), so they reach localhost:PORT inside llama-swap’s own network — this is how llama-swap knows when a model is ready.

Key insight for the GB10’s unified memory: CUDA sees ~121.7 GiB of the 128 GB physical RAM. Every model you load eats into that shared pool. llama-swap’s gpu_memory_utilization setting is your lever — it tells vLLM “claim at most X% of 121.7 GiB”.


Benchmark Results (2026-04-22, single DGX Spark GB10)

Measured with llama-benchy by @eugr.
pp2048 = prompt processing 2048 tokens (tok/s, higher is better) · tg128 = token generation 128 tokens (tok/s, higher is better)

S tier — Small / fast

Model Engine pp2048 (tok/s) tg128 (tok/s) Notes
Nemotron-3-Nano-4B-FP8 vLLM 8179 39.8 Instant responder, great for orchestration tasks
Nemotron-3-Nano-30B-A3B-NVFP4 vLLM 7417 55.9 Fastest 30B generation on the Spark
Qwen3.5-35B-Uncensored-Q4_K_M llama.cpp 1798 57.1 GGUF via llama.cpp, competitive with FP8 vLLM

M tier — Medium

Model Engine pp2048 (tok/s) tg128 (tok/s) Notes
Qwen3.5-35B-A3B-FP8 vLLM 4439 49.1 Solid all-rounder with native reasoning
Qwen3.6-35B-A3B-FP8 vLLM 4969 49.5 Slightly faster prefill than 3.5
Qwen3-VL-30B-A3B-Instruct-FP8 vLLM 9217 51.9 Vision model, exceptional prefill speed
Qwen3-Omni-30B-A3B-Instruct vLLM 5227 30.1 Audio + image + text multimodal
Qwen3-Coder-Next-FP8-Dynamic vLLM 3946 32.9 Full-precision coder model
Qwen3-Coder-Next-int4-AutoRound vLLM 4425 66.7 INT4 quant — fastest generation in M tier
Mistral-Small-24B-Instruct-2501 vLLM 2064 4.5 Dense model, limited by memory bandwidth

L tier — Large (solo only, evicts all others)

Model Engine pp2048 (tok/s) tg128 (tok/s) Notes
Qwen3.5-122B-A10B-int4-AutoRound vLLM (tf5) 2048 23.8 Best reasoning, fits in ~90 GB
GPT-OSS-120B (MXFP4) vLLM (mxfp4) 4703 56.4 Exceptional speed for a 120B model
Nemotron-3-Super-120B-A12B-NVFP4 vLLM 1823 14.5 NVIDIA’s flagship reasoning model

GPT-OSS-120B at 56 tok/s is the standout — a 120B model generating tokens faster than most 35B models thanks to CUTLASS MXFP4 kernels on the Blackwell architecture.


Prerequisites

  • DGX Spark with Ubuntu 22.04/24.04
  • Docker + NVIDIA Container Runtime (nvidia-container-runtime as default runtime in /etc/docker/daemon.json)
  • Portainer (optional but recommended for managing the stack)
  • GitHub account with a Personal Access Token (PAT) for GHCR image publishing
  • huggingface-cli for model downloads
// /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "args": []
        }
    }
}

Create the shared Docker network (one-time):

docker network create dgx_net

Step 1 — Clone the Repo

git clone https://github.com/mARTin-B78/dgx-spark_lite-llm_llama-swap_vllm_llama-cpp_ollama.git
cd dgx-spark_lite-llm_llama-swap_vllm_llama-cpp_ollama

Create your .env file:

cp .env.sample .env
# Edit .env and fill in:
#   GH_USER=your-github-username
#   GH_PAT=ghp_your_personal_access_token
#   LLM_ROOT_PATH=/home/YOUR_USER/LLMs
#   REPO_CONFIG_PATH=/home/YOUR_USER/Docker/dgx-spark_lite-llm_llama-swap_vllm_llama-cpp_ollama
#   LITELLM_MASTER_KEY=sk-choose-a-secure-key
#   POSTGRES_PASSWORD=choose-a-db-password

Step 2 — Build and Push the Base Images

The build_and_push.sh script builds and pushes five images to your GitHub Container Registry:

Image Purpose Dockerfile
llama-cpp-spark llama.cpp compiled for SM12.1 (arch 121) llama-cpp/llama-cpp.Dockerfile
llama-swap-spark llama-swap proxy (ARM64 binary) llama-swap/llama-swap.Dockerfile
ollama-spark Ollama (pinned version mirror) ollama/ollama.Dockerfile
litellm-spark LiteLLM gateway (stable release mirror) LiteLLM/litellm.Dockerfile
vllm-spark vLLM from nightly wheels (lightweight) vllm/vllm.Dockerfile
bash build_and_push.sh

What the llama.cpp build does differently for the Spark: It uses nvidia/cuda:13.1.0-devel-ubuntu24.04 (CUDA 13.1 is required to compile for architecture 121), creates the missing libcuda.so.1 ARM64 stub, and compiles with -DCMAKE_CUDA_ARCHITECTURES="121". Pre-built ARM64 llama.cpp binaries from most distros will silently fall back to CPU-only — this build ensures you actually use the GPU.


Step 3 — Build the vLLM Model-Serving Images

Credit: @eugrspark-vllm-docker

The build system in vllm/build/spark-vllm-docker/ downloads pre-built vLLM + FlashInfer wheels from eugr’s GitHub releases. These wheels are compiled specifically for the GB10 (CUDA 13.1, SM12.1a, ARM64 SBSA) and updated regularly. Without them, every build requires compiling FlashInfer and vLLM from source — a 2–4 hour process. The build script automatically falls back to source compilation if the pre-built wheels aren’t available or if you pass --rebuild-vllm/--rebuild-flashinfer. Thank you @eugr for maintaining this — it makes iterating on the stack practical.

The build_and_push.sh’s vllm-spark image is a lightweight wrapper. For actually serving models you need the purpose-built images from vllm/build/spark-vllm-docker/:

cd vllm/build/spark-vllm-docker

Build the standard image (used for Nemotron-30B, Qwen3-VL, Qwen3-Omni, Mistral, etc.):

bash build-and-copy.sh
# produces: vllm-node

Build with Transformers 5.x (required for Qwen3.5-122B-MoE, Qwen3.6-35B, Qwen3-Coder-Next — models using the newer Mamba/hybrid architecture):

bash build-and-copy.sh --tf5
# produces: vllm-node-tf5

Build with experimental MXFP4 support (for OpenAI GPT-OSS-120B):

Credit: @christopherowenspark-vllm-mxfp4-docker

This build uses christopherowen’s forks of vLLM, FlashInfer, and CUTLASS that add native MXFP4 quantization support for the GB10’s CUTLASS kernels. The --mxfp4-backend CUTLASS + --mxfp4-layers moe,qkv,o,lm_head flags this enables are what push GPT-OSS-120B from ~35 tok/s to ~57 tok/s on a single Spark. This is not in upstream vLLM yet. Huge thanks to @christopherowen for this work.

bash build-and-copy.sh --exp-mxfp4
# produces: vllm-node-mxfp4
# takes ~1 hour (compiles FlashInfer + CUTLASS fork from source)

Build times on the GB10 using @eugr’s pre-built wheels: ~15 minutes. From source: 2–4 hours.


Step 4 — Download Models

Install the HuggingFace CLI:

pip install huggingface-hub

Download each model into the directory structure expected by the config. Replace $LLM_ROOT_PATH with your actual path (e.g. /home/YOUR_USER/LLMs):

BASE=$LLM_ROOT_PATH/vllm

# --- S tier ---
huggingface-cli download nvidia/Nemotron-3-Nano-4B-FP8 \
  --local-dir $BASE/Nvidia/Nemotron-3-Nano-4B-FP8
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --local-dir $BASE/Nvidia/Nemotron-3-Nano-30B-A3B-NVFP4
huggingface-cli download Intel/Qwen3-Coder-Next-int4-AutoRound \
  --local-dir $BASE/Alibaba/Qwen3-Coder-Next-int4-AutoRound

# --- M tier ---
huggingface-cli download Qwen/Qwen3.5-35B-A3B-FP8 \
  --local-dir $BASE/Alibaba/Qwen3.5-35B-A3B-FP8
huggingface-cli download Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
  --local-dir $BASE/Alibaba/Qwen3-VL-30B-A3B-Instruct-FP8
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --local-dir $BASE/Alibaba/Qwen3-Omni-30B-A3B-Instruct
huggingface-cli download Qwen/Qwen3-Coder-Next-FP8-Dynamic \
  --local-dir $BASE/Alibaba/Qwen3-Coder-Next-FP8-Dynamic
huggingface-cli download mistralai/Mistral-Small-24B-Instruct-2501 \
  --local-dir $BASE/Mistral/Mistral-Small-24B-Instruct-2501

# --- L tier ---
huggingface-cli download Intel/Qwen3.5-122B-A10B-int4-AutoRound \
  --local-dir $BASE/Alibaba/Qwen3.5-122B-A10B-int4-AutoRound
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --local-dir $BASE/Nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
huggingface-cli download openai/gpt-oss-120b \
  --local-dir $BASE/OpenAI/GPT-OSS-120B

# --- GGUF (llama.cpp) ---
huggingface-cli download HauhauCS/Qwen3.5-35B-A3B-Uncensored-Aggressive \
  --include "*.gguf" --include "*.jinja" \
  --local-dir $LLM_ROOT_PATH/ollama/Alibaba/Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive

Step 5 — Prepare Config Files

REPO=/home/YOUR_USER/Docker/dgx-spark_lite-llm_llama-swap_vllm_llama-cpp_ollama
mkdir -p $REPO/llama-swap/scripts

llama-swap/config.yaml — the heart of the stack. Copy and patch the sample:

cp llama-swap/config.yaml.sample llama-swap/config.yaml
sed -i "s|/path/to/LLMs|$LLM_ROOT_PATH|g" llama-swap/config.yaml
sed -i "s|/path/to/Docker|$HOME/Docker|g" llama-swap/config.yaml

Key concepts in the config:

host: "0.0.0.0"
port: 8080
readyTimeout: 3600

macros:
  host: "0.0.0.0"
  tensor_parallel: "1"

groups:
  # S group: small/fast models — swap:true means evict previous before loading next
  small-models:
    swap: true
    exclusive: true   # evict M and L groups when any S model loads

  # M group: 30B FP8 / 24B BF16
  medium-models:
    swap: true
    exclusive: true

  # L group: 120B+ MoE, always solo
  large-models:
    swap: true
    exclusive: true

models:
  MyModel-30B:
    ttl: 600          # evict after 600s idle
    readyTimeout: 600
    cmd: >
      docker run --rm --name vllm-mymodel-${PORT}
      --runtime nvidia --gpus all --ipc=host
      --network container:llama-swap
      -e NVIDIA_DISABLE_FORWARD_COMPATIBILITY=1
      -v /home/YOUR_USER/LLMs/vllm:/models/vllm
      vllm-node
      vllm serve /models/vllm/MyOrg/MyModel-30B
      --served-model-name MyModel-30B
      --host ${host} --port ${PORT}
      --gpu-memory-utilization 0.70
      --max-model-len 131072
      --kv-cache-dtype fp8
      --load-format fastsafetensors
      --enable-prefix-caching
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --reasoning-parser qwen3
    cmdStop: "docker stop vllm-mymodel-${PORT}"

The --network container:llama-swap flag is the linchpin — it puts the model container inside llama-swap’s network namespace so it binds to localhost:${PORT}, which llama-swap proxies. Without it, llama-swap can’t reach the model.

See llama-swap/config.yaml.sample for the full annotated config with all models.

Critical memory math for 128 GB GB10:

CUDA-visible total: ~121.7 GiB

S tier (4B–30B quant):  0.50–0.65 × 121.7 = 61–79 GiB  → swap:true
M tier (30B–35B FP8):   0.60–0.75 × 121.7 = 73–91 GiB  → swap:true
L tier (120B+ MoE):     0.70–0.85 × 121.7 = 85–103 GiB → swap:true, solo

LiteLLM/config.yaml — copy the sample and update your master key:

cp LiteLLM/config.yaml.sample LiteLLM/config.yaml
sed -i "s|sk-your-litellm-master-key|YOUR_MASTER_KEY|g" LiteLLM/config.yaml

Each llama-swap model needs an entry pointing to http://llama-swap:8080/v1:

model_list:
  - model_name: MyModel-30B
    litellm_params:
      model: openai/MyModel-30B
      api_base: "http://llama-swap:8080/v1"
      api_key: "sk-your-master-key"
      supports_reasoning: true
      include_reasoning: true
      merge_reasoning_content_in_choices: true

See LiteLLM/config.yaml.sample for all models with their reasoning flags.


Step 6 — Dynamic VRAM Launcher for Large Models

Large MoE models (120B+ INT4/FP4) hit a recurring failure: after the previous model container exits, the CUDA memory allocator on the unified-memory GB10 doesn’t immediately return all memory to the free pool. vLLM’s startup check free_memory >= gpu_memory_utilization × total fails with a hardcoded high utilization value.

The fix is a small wrapper script that queries actual free VRAM at launch time and computes the safe utilization dynamically. Add it to llama-swap/scripts/ (this directory is mounted into the llama-swap container at /app/scripts/):

llama-swap/scripts/launch-large-model.sh — adapt MODEL_PATH, container name, and vllm flags for your model:

#!/bin/bash
# Dynamically sets --gpu-memory-utilization based on actually-free VRAM at launch time.
# Usage (from llama-swap cmd): /app/scripts/launch-large-model.sh PORT HOST
set -euo pipefail

PORT="${1}"
HOST="${2}"

# Query free/total VRAM via nvidia-smi inside the (already-cached) vllm image.
# Adds ~3s overhead — negligible vs. the several minutes this model takes to load.
MEM_LINE=$(docker run --rm --runtime nvidia --gpus all \
    vllm-node-tf5:latest \
    sh -c "nvidia-smi --query-gpu=memory.free,memory.total --format=csv,noheaders,nounits | head -1" \
    2>/dev/null || true)

FREE_MIB=$(echo "$MEM_LINE" | awk -F',' '{gsub(/ /,"",$1); print $1+0}')
TOTAL_MIB=$(echo "$MEM_LINE" | awk -F',' '{gsub(/ /,"",$2); print $2+0}')

# GB10: nvidia-smi reports 128 GiB (131072 MiB) unified memory;
# CUDA sees only 121.69 GiB (124610 MiB). Subtract that delta + 3 GiB safety margin
# so the computed value matches vLLM's view of available memory.
GMEM=$(awk -v f="$FREE_MIB" -v t_nv="$TOTAL_MIB" 'BEGIN {
    cuda_t   = 124610;
    safety   = 3072;
    overhead = (t_nv > cuda_t) ? t_nv - cuda_t : 0;
    cuda_free = f - overhead - safety;
    if (cuda_free < 0) cuda_free = 0;
    u = cuda_free / cuda_t;
    if (u > 0.85) u = 0.85;
    if (u < 0.60) u = 0.60;
    printf "%.2f", u;
}')

if [ -z "$GMEM" ] || [ "$FREE_MIB" = "0" ]; then
    echo "[auto-gmem] WARNING: VRAM query failed, using fallback 0.75"
    GMEM="0.75"
fi

echo "[auto-gmem] nvidia-smi free=${FREE_MIB}MiB / total=${TOTAL_MIB}MiB → gpu_memory_utilization=${GMEM}"

exec docker run --rm --name "vllm-mymodel-122b-${PORT}" \
    --runtime nvidia --gpus all --ipc=host --network container:llama-swap \
    -e NVIDIA_DISABLE_FORWARD_COMPATIBILITY=1 \
    -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
    -v /home/YOUR_USER/LLMs/vllm:/models/vllm \
    vllm-node-tf5:latest \
    vllm serve /models/vllm/MyOrg/MyModel-122B \
    --served-model-name MyModel-122B \
    --host "${HOST}" --port "${PORT}" \
    --gpu-memory-utilization "${GMEM}" \
    --max-model-len 131072 \
    --kv-cache-dtype fp8 \
    --load-format fastsafetensors \
    --attention-backend FLASHINFER \
    --mamba-ssm-cache-dtype float16 \
    --enable-prefix-caching \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3
chmod +x llama-swap/scripts/launch-large-model.sh

Reference it from llama-swap/config.yaml:

  MyModel-122B:
    ttl: 3600
    readyTimeout: 1800
    cmd: /app/scripts/launch-large-model.sh ${PORT} ${host}
    cmdStop: "docker stop vllm-mymodel-122b-${PORT}"

Step 7 — Deploy the Stack via Portainer

In Portainer → Stacks → Add Stack → paste the following. Replace all YOUR_* placeholders before deploying.

version: '3.8'
services:

  # ==========================================
  # 1. LLAMA.CPP (GGUF Engine)
  # ==========================================
  llama-server:
    image: ghcr.io/martin-b78/llama-cpp-spark:latest
    container_name: llama.cpp
    restart: unless-stopped
    ulimits:
      memlock: -1
      stack: 67108864
    ipc: host
    security_opt:
      - seccomp:unconfined
    ports:
      - "18080:18080"
    volumes:
      - /home/YOUR_USER/LLMs/ollama:/models/ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
    command:
      - --host
      - "0.0.0.0"
      - --port
      - "18080"
      - --parallel
      - "4"
      - --no-mmap
      - --context-shift
      - --models-dir
      - /models
      - --n-gpu-layers
      - "99"
      - --ctx-size
      - "16384"
    stdin_open: true
    tty: true
    networks:
      - dgx_net

  # ==========================================
  # 2. OLLAMA
  # ==========================================
  ollama:
    image: ghcr.io/martin-b78/ollama-spark:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - /home/YOUR_USER/LLMs/ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_FLASH_ATTENTION=1
      - OLLAMA_NUM_PARALLEL=1
      - OLLAMA_LLM_LIBRARY=cuda_v13
    ipc: host
    ulimits:
      memlock: -1
      stack: 67108864
    networks:
      - dgx_net

  # ==========================================
  # 3. LLAMA-SWAP (VRAM Orchestrator)
  # Port 28080 (host) → 8080 (container)
  # LiteLLM reaches it via Docker DNS: http://llama-swap:8080
  # Model containers attach via: --network container:llama-swap
  # ==========================================
  llama-swap:
    image: ghcr.io/martin-b78/llama-swap-spark:latest
    container_name: llama-swap
    restart: unless-stopped
    ports:
      - "28080:8080"
    entrypoint: ["/usr/bin/llama-swap", "-config", "/app/config.yaml", "-listen", "0.0.0.0:8080"]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    volumes:
      - /home/YOUR_USER/Docker/REPO_DIR/llama-swap:/app
      - /var/run/docker.sock:/var/run/docker.sock
      - /home/YOUR_USER/LLMs:/models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
      - GITHUB_TOKEN=YOUR_GITHUB_PAT
    networks:
      - dgx_net

  # ==========================================
  # 4. LITELLM DATABASE
  # ==========================================
  litellm-db:
    image: postgres:15-alpine
    container_name: litellm-postgres
    restart: unless-stopped
    environment:
      POSTGRES_DB: litellm
      POSTGRES_USER: litellm_admin
      POSTGRES_PASSWORD: YOUR_DB_PASSWORD
    ports:
      - "15432:5432"
    volumes:
      - litellm_db_data:/var/lib/postgresql/data
    networks:
      - dgx_net

  # ==========================================
  # 5. LITELLM GATEWAY
  # ==========================================
  litellm:
    image: ghcr.io/martin-b78/litellm-spark:latest
    container_name: litellm
    restart: unless-stopped
    depends_on:
      - litellm-db
    ports:
      - "14000:4000"
    environment:
      - DATABASE_URL=postgresql://litellm_admin:YOUR_DB_PASSWORD@litellm-db:5432/litellm
      - LITELLM_MASTER_KEY=YOUR_MASTER_KEY
    volumes:
      - /home/YOUR_USER/Docker/REPO_DIR/LiteLLM/config.yaml:/app/config.yaml
    command:
      - "--config"
      - "/app/config.yaml"
      - "--port"
      - "4000"
    networks:
      - dgx_net

networks:
  dgx_net:
    external: true

volumes:
  litellm_db_data:

Step 8 — Verify the Stack

# llama-swap up and listing all configured models
curl http://localhost:28080/v1/models | python3 -m json.tool

# LiteLLM health
curl http://localhost:14000/health

# Trigger a model load (llama-swap starts the container on first request)
curl http://localhost:28080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"MyModel-30B","messages":[{"role":"user","content":"Hello"}]}'

# Watch llama-swap load the container in real time
docker logs -f llama-swap

Step 9 — Benchmarking

Credit: @eugrllama-benchy

llama-benchy is a standardized LLM benchmark tool that measures prompt-processing (pp) and token-generation (tg) throughput in a reproducible way specifically designed for comparing results across the DGX Spark community. The benchmark script in this repo is a wrapper around llama-benchy. Thank you @eugr for creating and maintaining it — the consistent output format makes it possible to compare results across different model configs and post meaningful numbers to the forums.

Install llama-benchy first:

pip install llama-benchy
# or: uvx llama-benchy  (no install needed with uv)

Then run the full benchmark across all configured models:

bash benchmark-models.sh --endpoint http://localhost:28080

The script tests each model sequentially, runs a coherence check to detect repetition loops, and writes a results summary to test-results/. The llama-benchy output table is formatted for direct copy-paste into forum posts.


Tips & Common Issues

Model containers fail with “port already in use”
llama-swap assigns ports dynamically from its pool. Make sure the port range in config.yaml doesn’t overlap with other services on the host.

vLLM startup check: free_memory < gpu_memory_utilization × total
After stopping one model container, the CUDA allocator on unified-memory systems can hold freed memory for several seconds. Use the dynamic launcher script from Step 6 instead of a hardcoded --gpu-memory-utilization value for any model over 100B parameters.

Mamba/hybrid models need the tf5 image and an extra flag
Models using the Mamba SSM layers (Qwen3.5-122B-A10B, Qwen3.6-35B, Qwen3-Coder-Next) require vllm-node-tf5:latest (built with --tf5) and --mamba-ssm-cache-dtype float16 in the vllm serve command.

--load-format fastsafetensors is strongly recommended
It loads weight shards in parallel and cuts startup time by ~40% for multi-shard models. Requires model.safetensors.index.json to be present alongside the weight files (all HuggingFace multi-shard models include it).

GPT-OSS-120B MXFP4: skip Ray
Do not use --distributed-executor-backend ray for single-GPU MoE models. Ray’s GCS server + dashboard add ~500 MB overhead, which pushes total allocation past Ray’s 95% OOM threshold. vLLM’s default mp (multiprocessing) executor is leaner and also re-enables async scheduling.

S/M/L group sizing on 128 GB

CUDA-visible: ~121.7 GiB

S (4B–30B quantized):  0.50–0.65 × 121.7 GiB = 61–79 GiB  → swap:true
M (30B–35B FP8):       0.60–0.75 × 121.7 GiB = 73–91 GiB  → swap:true
L (120B+ MoE):         0.70–0.85 × 121.7 GiB = 85–103 GiB → swap:true, solo

With swap: true on all groups, the active model is always evicted before the next one loads. exclusive: true evicts all other groups when a new group activates — essential for preventing OOM when transitioning between tiers.


Acknowledgements

This stack stands on the shoulders of several people’s work:


Repo: GitHub - mARTin-B78/dgx-spark_lite-llm_llama-swap_vllm_llama-cpp_ollama: LLM Stack for nVidia DGX Spark containing LiteLLM, LamaSwap, vLLM, Llama.cpp and ollama · GitHub

6 Likes

This is exactly what I’ve been dreaming of!
Thank you for all your effort and for sharing!

I thought about letting the AI write a Setup / Installer Script..
where you have to put in some of the Variables like your Huggingface Key, some Ports you want to use or avoide, the location of your LLMs on the SSD…

Im trying this, but at step three you say cd vllm/build/spark-vllm-docker

But the folder doesnt exist.

Ive also raised some github issues about the tutorial

Unfortunately, this is pretty much a lot of new tools for me, I’m not experienced enough with docker, completely new to llama-swap, LiteLLM etc… I probably would need a lot more guidance from a tutorial.

It looks like I need to do a lot of edits to the compose.yml or compose.yml.sample file, and then run “docker compose up” at some point, but both files seem to have inconsistent paths in them.

Feeling a bit overwhelmed and out of my depth :| :D

Hey giles8,
thanks that you go through the tutorial.
Sorry that it has issues. - Told the AI to solve them and I think it did.

also went through the issues you reported on github and updated the tutorial

4. ✅ Fixed vllm/build/spark-vllm-docker/ Missing Directory Issue

Issue: Forum post mentioned the folder doesn’t exist after cloning. Users were confused about Step 3.

Fix: Added a clear note explaining this is a Git submodule:


# From the repo root, initialize submodules

git submodule update --init --recursive

# Then navigate to the build folder

cd vllm/build/spark-vllm-docker

This is now prominently displayed in Step 3 so users know what to do if the folder is missing.

Updated Tutorial

Thanks for the updates. I will give it another run in the next few days

Automated Setup Script (Now Available!)

I HAVE NOT TESTED THIS - CAUSE EVERYTHING IS SETUP AT MY SPARK ALREADY - USE AT YOUR OWN RISK

For users who find the manual setup overwhelming, the AI (Claude Haiku 4.5) has created an interactive setup wizard that automates the entire configuration process.

What It Does

The setup.sh script guides you through:

✅ Docker & NVIDIA runtime verification
✅ Service detection (Portainer, LiteLLM, llama.cpp, Ollama, llama-swap)
✅ Port conflict resolution (auto-detects and suggests alternatives)
✅ Credential collection (HuggingFace token, GitHub PAT)
✅ Model tier selection (S/M/L/GGUF)
✅ Automatic .env and docker-compose.yml generation
Time: ~5 minutes of guided prompts vs 45 minutes of manual configuration.

How to Use

#Get the latest code
git clone https://github.com/mARTin-B78/dgx-spark_lite-llm_llama-swap_vllm_llama-cpp_ollama.git
cd dgx-spark_lite-llm_llama-swap_vllm_llama-cpp_ollama

# Run the interactive setup wizard
./setup/setup.sh

# After setup completes, download models based on your selections
./setup/download-models.sh

# Start the stack
docker compose up -d

Two Setup Paths

Automated (Recommended for beginners):

# 5 min guided wizard
./setup/setup.sh  
      
# Auto-download based on selections
./setup/download-models.sh

Manual (For full control):

  1. Follow TUTORIAL.md step-by-step
  2. Manually edit all configuration files
  3. Ideal if you need custom setups

Both approaches end with the same result — a fully configured, running stack!

The Companion download-models.sh

This script reads your tier selections from .env and automatically downloads models:

  • Uses hf download for efficient, resumable transfers
  • Filters GGUF variants to Q4_K_M only (saves 70+ GB!)
  • Shows progress and final disk usage

What’s in the setup folder

setup/
├── setup.sh              # Interactive installer
├── download-models.sh    # Automated model downloader
└── README.md             # Full documentation & troubleshooting

🚀 DGX Spark LLM Stack Update: Configuration Templates & Secure Setup

What’s New:

We’ve completed a comprehensive update to the configuration templates used by the DGX Spark LLM stack, bringing everything in sync with the latest optimized production configurations.

✅ What’s Been Updated

LiteLLM Gateway (config.yaml.sample)

  • 11 optimized model definitions including Qwen 3.5/3.6 series and Nemotron models
  • Reasoning model support with proper tool-calling configurations
  • Ready-to-use template with secure placeholders (no actual API keys exposed)

VRAM Orchestrator (config.yaml.sample)

  • Updated 11 vLLM model launchers with latest optimization flags (MTP-2, FLASHINFER, chunked prefill)
  • Model tier grouping for automatic memory management
  • All paths use generic placeholders (<LLM_ROOT_PATH>, <REPO_CONFIG_PATH>) for instant cross-environment deployment

🎯 Key Advantages

  1. Production-Ready - Models pre-configured with proven optimization flags from live DGX Spark benchmarks

  2. Easy Deployment - Setup scripts automatically substitute placeholders with your paths; templates work across different environments

  3. Latest Models - Includes Qwen 3.6 uncensored, Nemotron-3 Super 120B, and reasoning-enabled models

  4. Clear Documentation - Both automated (5-min setup) and manual (45-min detailed) configuration paths available in README.md

📝 Next Steps

  1. Users can now clone the repo and run bash setup/setup.sh for automated configuration
  2. Manual setup users have detailed, updated templates in each service folder
  3. All defaults reflect real-world optimizations tested on GB10 hardware
2 Likes

If you want to try it, I’m quite keen to build out key-issuance on llm-proxy - GitHub - wentbackward/llm-proxy: Transparent reverse proxy for LLM APIs — virtual models, parameter profiles, and OTel metrics for any OpenAI-compatible backend · GitHub - it’s super-fast, written in Go. The proxy itself is not contributing much to the overall speed, but there’s a nice feature which will help you.

llm-proxy (I probably need a different name) works different to LiteLLM - I’ve created this concept of virtual models where you can specify defaults or clamp the sampling parameters. Take Qwen3.6 for example. You can’t take a coding/thinking model and just turn thinking off, it’s highly sub-optimal. So you can create several virtual models from the same underlying model.

Concrete example from my 2-node spark setup

coder - typical 20t/s - cyankiwi/Qwen3.6-27B-AWQ:
temperature: 0.6, top_p: 0.95, top_k: 20, presence_penalty: 0.1, repetition_penalty: 1.1, min_p: 0.08, enable_thinking: true

coder-moe - typical 40t/s - Qwen/Qwen3.6-35B-A3B-FP8:
temperature: 0.6, top_p: 0.95, top_k: 20, presence_penalty: 0.0, repetition_penalty: 1.0, enable_thinking: true

general - typical 40t/s - Qwen/Qwen3.6-35B-A3B-FP8:
temperature: 1.0, top_p: 0.95, top_k: 20, min_p: 0.00, presence_penalty: 0.0, repetition_penalty: 1.0, enable_thinking: true

instruct - typical 40t/s - cyankiwi/Qwen3.6-27B-AWQ:
temperature: 0.7, top_p: 0.80, top_k: 20, presence_penalty: 1.5, repetition_penalty: 1.05, min_p: 0.00, enable_thinking: false

When I’m performing maintenance, I switch my users over to hugging face, by creating the virtual route in the config file and SIGHUP, they have no idea when I take my sparks down. I also have a bunch of other machines such as Jetson Nano’s with 8GB unified RAM where I run embed’s, classifiers and TTS services. This cheaply keeps the Sparks dedicated. I route the moe to primary on one spark, the coder model head is on the 2nd spark and distrubted. This means openclaw etc runs in it’s bursty nature, coding can get long runs. I have some overhead on 1 spark to play.

In real usage:
openclaw - general
OpenWebUI - general but you also get all the available models from llm-proxy
pi-coder etc - coder-moe for architecture, design, ui, planning, brainstorming
pi-coder etc - coder-moe one-shot coding tasks
pi-coder etc - coder precise coding
system admin - instruct running shell commands etc.
workflow agents - coder-moe or instruct depending on type of task. coder-moe is better for figuring stuff out from SOP’s, instruct is better for constructing tool calls or CLI usage.

So that’s the messy reality (at least it’s my messy reality) of running my own AI. The users can just choose the model they need and the sampling parameters are tuned for the role.

I hope that helps optimizing performance.

For llm-proxy I’m planning to build a bearer token plug-in. Issue a token, permitted to specific models (or unrestricted), rate limits (or unlimited). Let me know if you have any requirements for key management, I’d like to feed that into the design.
Paul

I tried the setup script, have raiised issue in github with the results. I did proceed with hardcoding the script responses, but still found issues.

I think the model downloads are fixed, still dont see anything other than dockerfile in the vllm folder.

I like your concept, and may use it for building my own, but I think the setup needs a few iterations before it can be used.

thanks for giving it another go. Your input is very valuable for me. Cause its really hard to bugfix something that you do not try yourself.
Took me so long to get my system running so I am worried messing it up by trying the installer.

Told the AI to fix it and it did and pushed the changes to github.

Script parses cleanly.

Summary of fixes in setup/setup.sh:

  1. NVIDIA runtime check (setup.sh:115-127) — replaced docker run --rm --runtime=nvidia nvidia/cuda:11.0-runtime nvidia-smi (no arm64 variant, far too old for Blackwell — false negative on the GB10) with a docker info parse. Detects three states: nvidia is the default runtime ✅, nvidia is registered but not default ⚠️, or not present.
  2. Syntax error at line 320 (setup.sh:320-334) — the unquoted (4B-30B...) and (y/n) inside $(echo -e ...) were being parsed by bash as subshells. Quoted the echo -e argument on all four read -p lines.

The reporter on issue #5 should now see “NVIDIA Container Runtime is configured (default runtime: nvidia)” given their daemon.json, and the script will run through Step 10 instead of dying at line 320.