DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference

WilliamD · November 22, 2025, 12:48pm

DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference

Greetings to the community,

I am pleased to share my comprehensive documentation regarding the construction of a generative AI stack that leverages the FP4 revolution with the openai/gpt-oss-20b and 120b models.

Introduction

This documentation describes the architecture and deployment of a local generative AI stack on the NVIDIA DGX Spark (GB10). This infrastructure transforms the DGX into an autonomous AI workstation, capable of rivalling cloud-based services (GPT-4/Copilot) whilst guaranteeing complete data sovereignty and negligible latency.

1. Architectural Philosophy: The “Bicephalous System”

Rather than pursuing a single “average-sized model” (such as a 70B), we have opted for an asymmetrical architecture comprising two specialised cognitive units operating in parallel. This approach elegantly resolves the classical Speed versus Intelligence dilemma:

Role	Model	Function	Key Characteristic
The Brain	GPT-OSS 120B	Reasoning, Architecture, Complex Refactoring	Maximum intelligence, acceptable latency
The Sprinter	GPT-OSS 20B	Auto-completion, Rapid Chat, Simple Functions	Minimal latency (<20ms), high throughput

Both models are served simultaneously via TensorRT-LLM, exposing OpenAI-compatible APIs that any client (VS Code, Open WebUI) may consume.

2. Technical Optimisations (The “Secret Sauce”)

Accommodating 140 billion parameters on a single machine with 128 GB of unified memory constitutes a feat of precision engineering. Herewith are the technical choices that render this possible:

A. MXFP4 Compression (Micro-scaling)

We exploit the Blackwell (GB10) architecture, which natively supports 4-bit format.

Impact: The 120B model is reduced from approximately 240 GB (FP16) to approximately 70 GB in RAM.
Performance: Utilisation of specialised Tensor Cores for inference without significant precision degradation.

B. Memory Management (The “VRAM Tetris”)

Memory allocation is calculated to the gigabyte to prevent OOM (Out Of Memory) conditions:

Slot 1 (120B): Loaded first. Occupies approximately 60% of total memory.
Slot 2 (20B): Loaded in the remaining space.
- Safety measure: Configuration of free_gpu_memory_fraction: 0.4 (it consumes only 40% of what remains to provide headroom for the OS).

C. “Eager” Execution Mode

We utilise TensorRT-LLM’s --trust-remote-code mode.

Advantage: No lengthy and rigid static compilation (.engine). The engine employs the model’s Python code to construct the execution graph dynamically.
Flexibility: Enables handling of exotic architectures (MoE — Mixture of Experts) without manually patching JSON configuration files.

3. Services & Ports

The infrastructure is exposed on the DGX’s local network:

Service	Port	API Endpoint	Target Usage
GPT-OSS 120B	8356	`http://localhost:8356/v1`	“Senior Architect” in Continue/Cline
GPT-OSS 20B	8355	`http://localhost:8355/v1`	“Tab Autocomplete” & “Rapid Chat”

4. Docker Compose Stack

The ensemble is orchestrated by a single container spark_trtllm_production based on the nvcr.io/nvidia/tensorrt-llm:spark-single-gpu-dev image.

Volume Structure

Model Persistence: ~/.cache/huggingface is mounted to prevent re-downloads (70GB+).
Launch Scripts: ~/triton_benchmarks/model_engines contains the Bash scripts (launch_120b.sh, launch_20b.sh) that control the engines.

Lifecycle Management

Startup: Automatic (restart: unless-stopped).
Sequence: The entry script launches the 20B first, waits 10 seconds, then launches the 120B to ensure orderly memory allocation.
Logs: Centralised via docker compose logs -f.

Docker Compose Configuration (docker-compose.trtllm.yml)

services:
  spark-ai-core:
    image: nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
    container_name: spark_ai_production
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    
    network_mode: host
    restart: unless-stopped
    ulimits:
      memlock: -1
      stack: 67108864
    ipc: host
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ~/triton_benchmarks/model_engines:/model_engines
    environment:
      - HF_TOKEN=${HF_TOKEN}
    entrypoint: ["/bin/bash", "-c"]
    command:
      - |
        echo "🚀 Starting DGX Spark Infrastructure..."
        
        pip install "huggingface_hub<1.0" > /dev/null 2>&1
        
        echo "🔵 Launching 20B..."
        /model_engines/launch_20b.sh &
        
        sleep 10
        
        echo "🟣 Launching 120B..."
        /model_engines/launch_120b.sh &
        
        echo "✅ System online. Streaming logs..."
        tail -f /var/log/gpt_oss_20b_server.log /var/log/gpt_120b_server.log

5. TensorRT-LLM Serve Launch Scripts

GPT-OSS-20B Launch Script

#!/bin/bash
set -e
MODEL_HANDLE="openai/gpt-oss-20b"
TIKTOKEN_DIR="/tmp/harmony-reqs"
CONFIG_FILE="/tmp/extra-llm-api-config-20b.yml"
LOG_FILE="/var/log/gpt_oss_20b_server.log"
PORT=8355

echo "🌟 STRICT STARTUP (NVIDIA DOC) - PORT $PORT"

export TIKTOKEN_ENCODINGS_BASE="$TIKTOKEN_DIR"
mkdir -p $TIKTOKEN_DIR
if [ ! -f "$TIKTOKEN_DIR/cl100k_base.tiktoken" ]; then
    wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
    wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
fi

echo "📥 Verifying files..."
python3 -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='$MODEL_HANDLE')"

cat > $CONFIG_FILE <<YAML
print_iter_log: false
kv_cache_config:
  dtype: "auto"
  free_gpu_memory_fraction: 0.5
cuda_graph_config:
  enable_padding: true
disable_overlap_scheduler: true
YAML

echo "🔥 Launching trtllm-serve..."
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

nohup trtllm-serve "$MODEL_HANDLE" \
  --host 0.0.0.0 \
  --port $PORT \
  --max_batch_size 64 \
  --trust_remote_code \
  --extra_llm_api_options $CONFIG_FILE \
  > $LOG_FILE 2>&1 &

echo "🎉 Launched. Logs: tail -f $LOG_FILE"

GPT-OSS-120B Launch Script

#!/bin/bash
set -e
MODEL_HANDLE="openai/gpt-oss-120b"
TIKTOKEN_DIR="/tmp/harmony-reqs"
CONFIG_FILE="/tmp/extra-llm-api-config-120b.yml"
LOG_FILE="/var/log/gpt_120b_server.log"
PORT=8356

echo "🌟 STRICT STARTUP (NVIDIA DOC) - PORT $PORT"

export TIKTOKEN_ENCODINGS_BASE="$TIKTOKEN_DIR"
mkdir -p $TIKTOKEN_DIR
if [ ! -f "$TIKTOKEN_DIR/cl100k_base.tiktoken" ]; then
    wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
    wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
fi

echo "📥 Verifying files..."
python3 -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='$MODEL_HANDLE')"

cat > $CONFIG_FILE <<YAML
print_iter_log: false
kv_cache_config:
  dtype: "auto"
  free_gpu_memory_fraction: 0.4
cuda_graph_config:
  enable_padding: true
disable_overlap_scheduler: true
YAML

echo "🔥 Launching trtllm-serve..."
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

nohup trtllm-serve "$MODEL_HANDLE" \
  --host 0.0.0.0 \
  --port $PORT \
  --max_batch_size 32 \
  --trust_remote_code \
  --extra_llm_api_options $CONFIG_FILE \
  > $LOG_FILE 2>&1 &

echo "🎉 Launched. Logs: tail -f $LOG_FILE"

6. Benchmarking Analysis & Performance

Benchmarking Procedure (genai-perf)

The reference tool is GenAI-Perf, executed from the SDK client container.

Sweep Benchmark Script

#!/bin/bash
set -e
MODEL="gpt-oss-20b"
PORT=8355
TOKENIZER="openai/gpt-oss-120b"
OUTPUT_DIR="artifacts/sweep_data"
mkdir -p $OUTPUT_DIR

echo "🚀 Starting Sweep on $MODEL (Port $PORT)..."

for CONCURRENCY in 1 2 4 8 16 32 64 128; do
    echo "------------------------------------------------"
    echo "🧪 Testing with Concurrency: $CONCURRENCY"
    echo "------------------------------------------------"
    
    genai-perf profile \
      -m "$MODEL" \
      --endpoint-type chat \
      --url localhost:$PORT \
      --concurrency $CONCURRENCY \
      --streaming \
      --synthetic-input-tokens-mean 128 \
      --synthetic-input-tokens-stddev 20 \
      --output-tokens-mean 128 \
      --output-tokens-stddev 20 \
      --tokenizer "$TOKENIZER" \
      --num-prompts 50 \
      --artifact-dir "$OUTPUT_DIR/c${CONCURRENCY}" > /dev/null 2>&1
      
    echo "✅ Complete."
done

echo "🎉 Sweep complete! Data ready for graphical analysis."

Limitations of the “Dev Preview” Version (25.10)

⚠️ Caveat: The Docker image employed (nvcr.io/nvidia/tritonserver:25.10-py3-igpu-sdk) is a “bleeding edge” version that contains incomplete or buggy functionalities.

Functionality	Status	Error Encountered
`genai-perf analyze`	❌ Non-functional	Python crash: `AttributeError: 'Namespace' object has no attribute 'sweep_min'`
`genai-perf compare`	❌ Absent	Command not recognised by the binary
`--generate-plots`	❌ Non-functional	IO crash: `FileNotFoundError: .../plots/config.yaml` (Directory not created)

Consequence: The tool generates raw data (CSV) perfectly, but is incapable of producing graphs or visual comparisons.

Solution: Custom Visualisation Script

To address these shortcomings, we employ the Python script plot_genai.pyor plot_advanced which parses the CSVs and generates the graphs.

Usage:

Analyse a single benchmark:

python3 plot_genai.py profile artifacts/my_result/profile_export.csv

Compare two models (e.g., 20B vs 120B):

python3 plot_genai.py compare \
  --files artifacts/result_20b.csv artifacts/result_120b.csv \
  --labels "Fast 20B" "Brain 120B"

7. Results

GPT-OSS-120B at Concurrency 1

Metric	Average	P95
Time To First Token (ms)	1,157.61	1,464.95
Request Latency (ms)	4,030.15	4,897.14
Output Token Throughput (tokens/sec)	23.66	—

GPT-OSS-120B at Concurrency 32

Metric	Average	P95
Time To First Token (ms)	23,862.06	33,187.09
Request Latency (ms)	30,565.06	40,744.92
Output Token Throughput (tokens/sec)	75.35	—

GPT-OSS-120B at Concurrency 128

Metric	Average	P95
Time To First Token (ms)	105,049.49	142,049.41
Request Latency (ms)	111,618.57	149,083.30
Output Token Throughput (tokens/sec)	76.81	—

I must confess that I encountered considerable difficulty with my plotting scripts, particularly regarding the comma separator appearing both within strings and as numerical values separator. Consequently, I would advise against treating these figures as definitive, as I have yet to conduct thorough benchmarks and cross-validate my results against other established tools.

8. Interpretation of Results: The “Lorry” versus the “Formula 1”

The benchmarks conducted on the GPT-OSS-20B and the 120B reveal the true nature of the NVIDIA GB10 (Blackwell) processor.

The Analogy

A Formula 1 (Gaming/Consumer GPU): Designed to travel extremely fast with a single passenger. It exhibits very low latency but saturates rapidly when weight is added.
A Heavy Goods Vehicle (DGX Spark / GB10): Designed to transport 50 tonnes. It starts more slowly (higher latency), but whether carrying 1 or 50 passengers, it maintains the same speed without deceleration.

This architectural characteristic makes the GB10 exceptionally well-suited for production workloads where consistent throughput under varying concurrent loads is paramount, rather than minimising single-request latency.

Wish you all the best in you AI adventure,

William

eugr · December 13, 2025, 1:04am

You might want to look into SGLang or at least vLLM. Or llama.cpp if you don’t care about concurrency too much (although it can handle it). All of these can serve gpt-oss-120b with much better performance. 24 t/s for gpt-oss-120b on Spark is way too slow.

You can expect:

35 t/s from vLLM
52 t/s from SGLang (using their lmsys/sglang:spark image)
58 t/s from llama.cpp

(these are single request numbers, concurrent would be higher (for instance, I get peak 125 t/s throughput on 10 concurrent requests with SGLang).

gpt-oss-20b will be significantly faster, however with gpt-oss-120b that fast I wouldn’t even bother with it. If you need a second model, you can run one of qwen3-vl ones for vision input, for instance.

josephbreda · December 14, 2025, 4:32pm

this is the way

WilliamD · December 15, 2025, 11:28am

Hello,

You are right. Since then, I’ve done more bench tests with vLLM, and it was very good. However, I can’t say for sure that using vLLM instead of TRT-LLM gives me better stability for the tokens/second (t/s) under concurrent load. I have yet to implement SGLang, but I’ve read good things about it online (cf. article: NVIDIA DGX Spark Review: The AI Appliance Bringing Datacenter Capabilities to Desktops - StorageReview.com).

Thank you for your suggestion. I will try SGLang and run more bench tests when I have time during the holidays.

Wish you all happy festivities.

Topic		Replies	Views
Dgx spark benchmark performance DGX Spark / GB10	16	1063	December 21, 2025
6x Spark setup DGX Spark / GB10	31	819	December 22, 2025
GDX Spark is extremely slow on a short LLM test DGX Spark / GB10	18	1149	December 4, 2025
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	17	618	December 20, 2025
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	34	514	December 17, 2025
Setting up vLLM, SGLang or TensorRT on two DGX Sparks DGX Spark / GB10	16	380	December 7, 2025
Model Orchestration and Deployment DGX Spark / GB10 nim	4	249	November 24, 2025
Reviews are coming in DGX Spark / GB10	27	5288	November 24, 2025
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	2426	December 9, 2025
DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model DGX Spark / GB10 nim , llama	32	363	December 22, 2025

DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference

DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference

Introduction

1. Architectural Philosophy: The “Bicephalous System”

2. Technical Optimisations (The “Secret Sauce”)

A. MXFP4 Compression (Micro-scaling)

B. Memory Management (The “VRAM Tetris”)

C. “Eager” Execution Mode

3. Services & Ports

4. Docker Compose Stack

Volume Structure

Lifecycle Management

Docker Compose Configuration (docker-compose.trtllm.yml)

5. TensorRT-LLM Serve Launch Scripts

GPT-OSS-20B Launch Script

GPT-OSS-120B Launch Script

6. Benchmarking Analysis & Performance

Benchmarking Procedure (genai-perf)

Sweep Benchmark Script

Limitations of the “Dev Preview” Version (25.10)

Solution: Custom Visualisation Script

7. Results

GPT-OSS-120B at Concurrency 1

GPT-OSS-120B at Concurrency 32

GPT-OSS-120B at Concurrency 128

8. Interpretation of Results: The “Lorry” versus the “Formula 1”

The Analogy

Related topics