DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference

DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference

Greetings to the community,

I am pleased to share my comprehensive documentation regarding the construction of a generative AI stack that leverages the FP4 revolution with the openai/gpt-oss-20b and 120b models.


Introduction

This documentation describes the architecture and deployment of a local generative AI stack on the NVIDIA DGX Spark (GB10). This infrastructure transforms the DGX into an autonomous AI workstation, capable of rivalling cloud-based services (GPT-4/Copilot) whilst guaranteeing complete data sovereignty and negligible latency.


1. Architectural Philosophy: The “Bicephalous System”

Rather than pursuing a single “average-sized model” (such as a 70B), we have opted for an asymmetrical architecture comprising two specialised cognitive units operating in parallel. This approach elegantly resolves the classical Speed versus Intelligence dilemma:

Role Model Function Key Characteristic
The Brain GPT-OSS 120B Reasoning, Architecture, Complex Refactoring Maximum intelligence, acceptable latency
The Sprinter GPT-OSS 20B Auto-completion, Rapid Chat, Simple Functions Minimal latency (<20ms), high throughput

Both models are served simultaneously via TensorRT-LLM, exposing OpenAI-compatible APIs that any client (VS Code, Open WebUI) may consume.


2. Technical Optimisations (The “Secret Sauce”)

Accommodating 140 billion parameters on a single machine with 128 GB of unified memory constitutes a feat of precision engineering. Herewith are the technical choices that render this possible:

A. MXFP4 Compression (Micro-scaling)

We exploit the Blackwell (GB10) architecture, which natively supports 4-bit format.

  • Impact: The 120B model is reduced from approximately 240 GB (FP16) to approximately 70 GB in RAM.
  • Performance: Utilisation of specialised Tensor Cores for inference without significant precision degradation.

B. Memory Management (The “VRAM Tetris”)

Memory allocation is calculated to the gigabyte to prevent OOM (Out Of Memory) conditions:

  1. Slot 1 (120B): Loaded first. Occupies approximately 60% of total memory.
  2. Slot 2 (20B): Loaded in the remaining space.
    • Safety measure: Configuration of free_gpu_memory_fraction: 0.4 (it consumes only 40% of what remains to provide headroom for the OS).

C. “Eager” Execution Mode

We utilise TensorRT-LLM’s --trust-remote-code mode.

  • Advantage: No lengthy and rigid static compilation (.engine). The engine employs the model’s Python code to construct the execution graph dynamically.
  • Flexibility: Enables handling of exotic architectures (MoE — Mixture of Experts) without manually patching JSON configuration files.

3. Services & Ports

The infrastructure is exposed on the DGX’s local network:

Service Port API Endpoint Target Usage
GPT-OSS 120B 8356 http://localhost:8356/v1 “Senior Architect” in Continue/Cline
GPT-OSS 20B 8355 http://localhost:8355/v1 “Tab Autocomplete” & “Rapid Chat”

4. Docker Compose Stack

The ensemble is orchestrated by a single container spark_trtllm_production based on the nvcr.io/nvidia/tensorrt-llm:spark-single-gpu-dev image.

Volume Structure

  • Model Persistence: ~/.cache/huggingface is mounted to prevent re-downloads (70GB+).
  • Launch Scripts: ~/triton_benchmarks/model_engines contains the Bash scripts (launch_120b.sh, launch_20b.sh) that control the engines.

Lifecycle Management

  • Startup: Automatic (restart: unless-stopped).
  • Sequence: The entry script launches the 20B first, waits 10 seconds, then launches the 120B to ensure orderly memory allocation.
  • Logs: Centralised via docker compose logs -f.

Docker Compose Configuration (docker-compose.trtllm.yml)

services:
  spark-ai-core:
    image: nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
    container_name: spark_ai_production
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    
    network_mode: host
    restart: unless-stopped
    ulimits:
      memlock: -1
      stack: 67108864
    ipc: host
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ~/triton_benchmarks/model_engines:/model_engines
    environment:
      - HF_TOKEN=${HF_TOKEN}
    entrypoint: ["/bin/bash", "-c"]
    command:
      - |
        echo "🚀 Starting DGX Spark Infrastructure..."
        
        pip install "huggingface_hub<1.0" > /dev/null 2>&1
        
        echo "🔵 Launching 20B..."
        /model_engines/launch_20b.sh &
        
        sleep 10
        
        echo "🟣 Launching 120B..."
        /model_engines/launch_120b.sh &
        
        echo "✅ System online. Streaming logs..."
        tail -f /var/log/gpt_oss_20b_server.log /var/log/gpt_120b_server.log

5. TensorRT-LLM Serve Launch Scripts

GPT-OSS-20B Launch Script

#!/bin/bash
set -e
MODEL_HANDLE="openai/gpt-oss-20b"
TIKTOKEN_DIR="/tmp/harmony-reqs"
CONFIG_FILE="/tmp/extra-llm-api-config-20b.yml"
LOG_FILE="/var/log/gpt_oss_20b_server.log"
PORT=8355

echo "🌟 STRICT STARTUP (NVIDIA DOC) - PORT $PORT"

export TIKTOKEN_ENCODINGS_BASE="$TIKTOKEN_DIR"
mkdir -p $TIKTOKEN_DIR
if [ ! -f "$TIKTOKEN_DIR/cl100k_base.tiktoken" ]; then
    wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
    wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
fi

echo "📥 Verifying files..."
python3 -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='$MODEL_HANDLE')"

cat > $CONFIG_FILE <<YAML
print_iter_log: false
kv_cache_config:
  dtype: "auto"
  free_gpu_memory_fraction: 0.5
cuda_graph_config:
  enable_padding: true
disable_overlap_scheduler: true
YAML

echo "🔥 Launching trtllm-serve..."
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

nohup trtllm-serve "$MODEL_HANDLE" \
  --host 0.0.0.0 \
  --port $PORT \
  --max_batch_size 64 \
  --trust_remote_code \
  --extra_llm_api_options $CONFIG_FILE \
  > $LOG_FILE 2>&1 &

echo "🎉 Launched. Logs: tail -f $LOG_FILE"

GPT-OSS-120B Launch Script

#!/bin/bash
set -e
MODEL_HANDLE="openai/gpt-oss-120b"
TIKTOKEN_DIR="/tmp/harmony-reqs"
CONFIG_FILE="/tmp/extra-llm-api-config-120b.yml"
LOG_FILE="/var/log/gpt_120b_server.log"
PORT=8356

echo "🌟 STRICT STARTUP (NVIDIA DOC) - PORT $PORT"

export TIKTOKEN_ENCODINGS_BASE="$TIKTOKEN_DIR"
mkdir -p $TIKTOKEN_DIR
if [ ! -f "$TIKTOKEN_DIR/cl100k_base.tiktoken" ]; then
    wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
    wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
fi

echo "📥 Verifying files..."
python3 -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='$MODEL_HANDLE')"

cat > $CONFIG_FILE <<YAML
print_iter_log: false
kv_cache_config:
  dtype: "auto"
  free_gpu_memory_fraction: 0.4
cuda_graph_config:
  enable_padding: true
disable_overlap_scheduler: true
YAML

echo "🔥 Launching trtllm-serve..."
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

nohup trtllm-serve "$MODEL_HANDLE" \
  --host 0.0.0.0 \
  --port $PORT \
  --max_batch_size 32 \
  --trust_remote_code \
  --extra_llm_api_options $CONFIG_FILE \
  > $LOG_FILE 2>&1 &

echo "🎉 Launched. Logs: tail -f $LOG_FILE"

6. Benchmarking Analysis & Performance

Benchmarking Procedure (genai-perf)

The reference tool is GenAI-Perf, executed from the SDK client container.

Sweep Benchmark Script

#!/bin/bash
set -e
MODEL="gpt-oss-20b"
PORT=8355
TOKENIZER="openai/gpt-oss-120b"
OUTPUT_DIR="artifacts/sweep_data"
mkdir -p $OUTPUT_DIR

echo "🚀 Starting Sweep on $MODEL (Port $PORT)..."

for CONCURRENCY in 1 2 4 8 16 32 64 128; do
    echo "------------------------------------------------"
    echo "🧪 Testing with Concurrency: $CONCURRENCY"
    echo "------------------------------------------------"
    
    genai-perf profile \
      -m "$MODEL" \
      --endpoint-type chat \
      --url localhost:$PORT \
      --concurrency $CONCURRENCY \
      --streaming \
      --synthetic-input-tokens-mean 128 \
      --synthetic-input-tokens-stddev 20 \
      --output-tokens-mean 128 \
      --output-tokens-stddev 20 \
      --tokenizer "$TOKENIZER" \
      --num-prompts 50 \
      --artifact-dir "$OUTPUT_DIR/c${CONCURRENCY}" > /dev/null 2>&1
      
    echo "✅ Complete."
done

echo "🎉 Sweep complete! Data ready for graphical analysis."

Limitations of the “Dev Preview” Version (25.10)

⚠️ Caveat: The Docker image employed (nvcr.io/nvidia/tritonserver:25.10-py3-igpu-sdk) is a “bleeding edge” version that contains incomplete or buggy functionalities.

Functionality Status Error Encountered
genai-perf analyze ❌ Non-functional Python crash: AttributeError: 'Namespace' object has no attribute 'sweep_min'
genai-perf compare ❌ Absent Command not recognised by the binary
--generate-plots ❌ Non-functional IO crash: FileNotFoundError: .../plots/config.yaml (Directory not created)

Consequence: The tool generates raw data (CSV) perfectly, but is incapable of producing graphs or visual comparisons.

Solution: Custom Visualisation Script

To address these shortcomings, we employ the Python script plot_genai.pyor plot_advanced which parses the CSVs and generates the graphs.

Usage:

  • Analyse a single benchmark:

    python3 plot_genai.py profile artifacts/my_result/profile_export.csv
    
  • Compare two models (e.g., 20B vs 120B):

    python3 plot_genai.py compare \
      --files artifacts/result_20b.csv artifacts/result_120b.csv \
      --labels "Fast 20B" "Brain 120B"
    

7. Results

GPT-OSS-120B at Concurrency 1

Metric Average P95
Time To First Token (ms) 1,157.61 1,464.95
Request Latency (ms) 4,030.15 4,897.14
Output Token Throughput (tokens/sec) 23.66

GPT-OSS-120B at Concurrency 32

Metric Average P95
Time To First Token (ms) 23,862.06 33,187.09
Request Latency (ms) 30,565.06 40,744.92
Output Token Throughput (tokens/sec) 75.35

GPT-OSS-120B at Concurrency 128

Metric Average P95
Time To First Token (ms) 105,049.49 142,049.41
Request Latency (ms) 111,618.57 149,083.30
Output Token Throughput (tokens/sec) 76.81

I must confess that I encountered considerable difficulty with my plotting scripts, particularly regarding the comma separator appearing both within strings and as numerical values separator. Consequently, I would advise against treating these figures as definitive, as I have yet to conduct thorough benchmarks and cross-validate my results against other established tools.


8. Interpretation of Results: The “Lorry” versus the “Formula 1”

The benchmarks conducted on the GPT-OSS-20B and the 120B reveal the true nature of the NVIDIA GB10 (Blackwell) processor.

The Analogy

  • A Formula 1 (Gaming/Consumer GPU): Designed to travel extremely fast with a single passenger. It exhibits very low latency but saturates rapidly when weight is added.

  • A Heavy Goods Vehicle (DGX Spark / GB10): Designed to transport 50 tonnes. It starts more slowly (higher latency), but whether carrying 1 or 50 passengers, it maintains the same speed without deceleration.

This architectural characteristic makes the GB10 exceptionally well-suited for production workloads where consistent throughput under varying concurrent loads is paramount, rather than minimising single-request latency.


Wish you all the best in you AI adventure,

William

1 Like

You might want to look into SGLang or at least vLLM. Or llama.cpp if you don’t care about concurrency too much (although it can handle it). All of these can serve gpt-oss-120b with much better performance. 24 t/s for gpt-oss-120b on Spark is way too slow.

You can expect:

  • 35 t/s from vLLM
  • 52 t/s from SGLang (using their lmsys/sglang:spark image)
  • 58 t/s from llama.cpp

(these are single request numbers, concurrent would be higher (for instance, I get peak 125 t/s throughput on 10 concurrent requests with SGLang).

gpt-oss-20b will be significantly faster, however with gpt-oss-120b that fast I wouldn’t even bother with it. If you need a second model, you can run one of qwen3-vl ones for vision input, for instance.

5 Likes

this is the way

Hello,

You are right. Since then, I’ve done more bench tests with vLLM, and it was very good. However, I can’t say for sure that using vLLM instead of TRT-LLM gives me better stability for the tokens/second (t/s) under concurrent load. I have yet to implement SGLang, but I’ve read good things about it online (cf. article: NVIDIA DGX Spark Review: The AI Appliance Bringing Datacenter Capabilities to Desktops - StorageReview.com).

Thank you for your suggestion. I will try SGLang and run more bench tests when I have time during the holidays.

Wish you all happy festivities.

@WilliamD , I believe you might be interested in the stack I just published GitHub - jdaln/dgx-spark-inference-stack: Serve the home! Inference stack for you Nvidia DGX Spark aka the Grace Blackwell AI supercomputer on your desk. Mostly vLLM based for now

@Julien D— Drop the MIC. This is one of the most comprehensive and practical posts on the DGX Spark forums. Huge thanks for sharing the repo — the on‑demand loading + idle shutdown via gateway/waker is elegant, memory‑smart, and feels production‑ready


Quick frontend question

Are you using Open WebUI and letting users pick models via the model field in the payload, or do you have a custom client / proxy in front of the gateway? I’d love to understand how model selection flows through to the end‑user experience — especially how you expose multiple backends behind a single OpenAI‑compatible endpoint.

I’m very interested in putting together stacks on DGX Spark for STEM research and coding, and I’d like to use my Spark to beat cloud LLMs on both latency and cost for my workloads.

https://forums.developer.nvidia.com/t/building-local-hybrid-llms-on-dgx-spark-that-outperform-top-cloud-models/359569/5


vLLM sleep / wake on Spark

I’ve started assembling stacks and digging into vLLM sleep/wake (L1/L2) on the Spark’s 128 GB unified memory pool. The docs say L1 frees more than 90% of GPU memory with wake times 18–200× faster than cold starts, but that’s mostly discrete‑GPU data. I’m planning to test on Spark soon with pre/post nvidia-smi and torch.cuda.mem_get_info() snapshots to see how it behaves under UMA.

The memory‑usage utility you shared looks perfect — especially if I can get a CSV with actual footprints. That would be a game‑changer for planning realistic multi‑model combos and understanding how much we can truly fit into a single Spark under different KV cache and sleep policies.

Here’s an updated (early) table of combos I’m exploring — not the full set yet:

Combo ID Models (core stack) Active headroom Best reranker Added mem (active) Sleep impact (L1/L2 est.) Why it fits this stack
A gpt‑oss‑120B + Qwen3‑Coder‑30B + GLM‑4.7‑Flash 8–13 GB Qwen3‑Reranker‑8B FP8 ~8 GB ~30–35 GB → <5 GB Qwen synergy + strong code/math reranking before the heavy model.
B Nemotron‑3‑Nano‑30B‑A3B + Qwen3‑Coder + GLM‑4.7‑Flash 13–18 GB Qwen3‑Reranker‑8B FP8 ~8 GB ~7–10 GB → <2 GB 1M context + Qwen reranker = long‑doc RAG; still leaves room for a VL model.
C Pure Qwen3 family 18–23 GB Qwen3‑Reranker‑8B FP8 ~8 GB ~8–12 GB → <2 GB Seamless family instructions + top‑tier code/math; can add a 0.6B embed pair.
D gpt‑oss‑120B + GLM‑4.7‑Flash + Llama‑3.1‑8B 10–18 GB Qwen3‑Reranker‑8B FP8 ~8 GB ~30–35 GB → <5 GB gpt‑oss for finance/HFT; reranker cleans chunks before the expensive model.

Your waker already nails fast switching. Sleep mode might let me keep 2–3 extra models warm, but your container‑per‑model approach may still be more robust for production isolation and failure modes.


Why I’m excited about Nemotron‑3‑Nano and MoE

Newer MoE models like Nemotron‑3‑Nano‑30B‑A3B are changing the game: with only a few billion active parameters, they deliver reasoning, coding, agentic behavior, and long‑context performance that rivals or beats much larger dense predecessors. That’s what makes me optimistic about using vLLM sleep/wake plus MoE to realistically deploy three to five models on a single Spark without brutal tradeoffs.

Model size is just one factor, though. I’m really focused on:

  • Prompt engineering tuned for STEM plus web/RAG.
  • Local RAG with Qdrant.
  • Careful temperature and decoding tuning.
  • Citation control and grounded answers.
  • High‑quality context embeddings and strong reranking.

It’s not the “sexy” part of the stack, but doing this well gets much better results out of whatever base model you’re running.


Request: stats / memory footprints

I saw in your repo TODO that you’re planning to explore more dynamic model combos and are already collecting GPU usage data in stats/. Any chance you could share the actual measured memory footprints (or CSV/output) for the models you’ve run on DGX Spark — including active usage and KV cache if you have it? Even approximate numbers per model plus typical batch size and max tokens would be incredibly useful for designing realistic multi‑model stacks on Spark’s unified memory.

That kind of data would make a huge difference for people trying to do what you’re doing in production, and I’d love to contribute back with my own Spark sleep/wake measurements once I’ve run them.

Massive thanks again — your work is moving the ecosystem forward. I’m excited to follow your TODOs and hopefully contribute (especially around sleep‑mode strategies and multi‑model orchestration).

Legendary. 🚀

1 Like

Hello @jd36 , this is great! You did a very good job, and thanks to you, it has broadened my horizons. Could you elaborate on what motivated you to build such a stack?

Cheers,
William

Simply getting this to work properly for my local setup :)

Hi @griffith.mark !
Thank you for your feedback. I’ll answer here but let’s continue the conversation on HOW-TO: setup-dgx-spark docker inference - A "Sane" Inference Stack for GB10 (Need Contributors!) please.
This way we don’t hijack the topic here.

Quick frontend question:
There is no frontend and I am not willing to run more than the inference side on the DGX Spark. The reasoning is that I don’t want to burden the DGX with more than the inference stack whenever possible to keep a maximum amount or GPU/CPU for inference. While beating cloud LLMs will not happen, you mmight be able to near some results with model like Kimi K2, Minimax, etc… but these will require 2-3 DGXs.

What I am open to is to include alternative inference servers that could run instead of vLLM and would also be managed by the waker. This makes sense for this stack so that they don’t conflict. This extends to, for instance, the inference part of the stack for GitHub - dr-vij/ComfyUI-DGX-Spark-Docker-opinionated: This is very very opinionated docker config for Comfy UI I built for my DGX Spark so that we can also serve multimedia inference.

vLLM sleep / wake on Spark
Please let me know how you do once it is working! :) I have hard times picturing this but you might know something that I do not.

Prompt engineering tuned for STEM plus web/RAG.
That sounds great that you have good results

Request: stats / memory footprints
I saw in your repo TODO that you’re planning to explore more dynamic model combos and are already collecting GPU usage data in stats/.
→ This is really after weeks of testing and stats collection for memory footprints so it’s on the long term. Before that, I’d like to see a few more models there. Since I don’t have a bunch of users at home, it will have to wait heavy agentic use, unless other people go with it and can demonstrate actual numbers.

That kind of data would make a huge difference for people trying to do what you’re doing in production, and I’d love to contribute back with my own Spark sleep/wake measurements once I’ve run them.
Massive thanks again — your work is moving the ecosystem forward. I’m excited to follow your TODOs and hopefully contribute (especially around sleep‑mode strategies and multi‑model orchestration).
→ That will be appreciated, thanks! Looking forward!

jd36

Last time I tried, sleep/wake didn’t work in the cluster.