DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference

DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference

Greetings to the community,

I am pleased to share my comprehensive documentation regarding the construction of a generative AI stack that leverages the FP4 revolution with the openai/gpt-oss-20b and 120b models.


Introduction

This documentation describes the architecture and deployment of a local generative AI stack on the NVIDIA DGX Spark (GB10). This infrastructure transforms the DGX into an autonomous AI workstation, capable of rivalling cloud-based services (GPT-4/Copilot) whilst guaranteeing complete data sovereignty and negligible latency.


1. Architectural Philosophy: The “Bicephalous System”

Rather than pursuing a single “average-sized model” (such as a 70B), we have opted for an asymmetrical architecture comprising two specialised cognitive units operating in parallel. This approach elegantly resolves the classical Speed versus Intelligence dilemma:

Role Model Function Key Characteristic
The Brain GPT-OSS 120B Reasoning, Architecture, Complex Refactoring Maximum intelligence, acceptable latency
The Sprinter GPT-OSS 20B Auto-completion, Rapid Chat, Simple Functions Minimal latency (<20ms), high throughput

Both models are served simultaneously via TensorRT-LLM, exposing OpenAI-compatible APIs that any client (VS Code, Open WebUI) may consume.


2. Technical Optimisations (The “Secret Sauce”)

Accommodating 140 billion parameters on a single machine with 128 GB of unified memory constitutes a feat of precision engineering. Herewith are the technical choices that render this possible:

A. MXFP4 Compression (Micro-scaling)

We exploit the Blackwell (GB10) architecture, which natively supports 4-bit format.

  • Impact: The 120B model is reduced from approximately 240 GB (FP16) to approximately 70 GB in RAM.
  • Performance: Utilisation of specialised Tensor Cores for inference without significant precision degradation.

B. Memory Management (The “VRAM Tetris”)

Memory allocation is calculated to the gigabyte to prevent OOM (Out Of Memory) conditions:

  1. Slot 1 (120B): Loaded first. Occupies approximately 60% of total memory.
  2. Slot 2 (20B): Loaded in the remaining space.
    • Safety measure: Configuration of free_gpu_memory_fraction: 0.4 (it consumes only 40% of what remains to provide headroom for the OS).

C. “Eager” Execution Mode

We utilise TensorRT-LLM’s --trust-remote-code mode.

  • Advantage: No lengthy and rigid static compilation (.engine). The engine employs the model’s Python code to construct the execution graph dynamically.
  • Flexibility: Enables handling of exotic architectures (MoE — Mixture of Experts) without manually patching JSON configuration files.

3. Services & Ports

The infrastructure is exposed on the DGX’s local network:

Service Port API Endpoint Target Usage
GPT-OSS 120B 8356 http://localhost:8356/v1 “Senior Architect” in Continue/Cline
GPT-OSS 20B 8355 http://localhost:8355/v1 “Tab Autocomplete” & “Rapid Chat”

4. Docker Compose Stack

The ensemble is orchestrated by a single container spark_trtllm_production based on the nvcr.io/nvidia/tensorrt-llm:spark-single-gpu-dev image.

Volume Structure

  • Model Persistence: ~/.cache/huggingface is mounted to prevent re-downloads (70GB+).
  • Launch Scripts: ~/triton_benchmarks/model_engines contains the Bash scripts (launch_120b.sh, launch_20b.sh) that control the engines.

Lifecycle Management

  • Startup: Automatic (restart: unless-stopped).
  • Sequence: The entry script launches the 20B first, waits 10 seconds, then launches the 120B to ensure orderly memory allocation.
  • Logs: Centralised via docker compose logs -f.

Docker Compose Configuration (docker-compose.trtllm.yml)

services:
  spark-ai-core:
    image: nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
    container_name: spark_ai_production
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    
    network_mode: host
    restart: unless-stopped
    ulimits:
      memlock: -1
      stack: 67108864
    ipc: host
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ~/triton_benchmarks/model_engines:/model_engines
    environment:
      - HF_TOKEN=${HF_TOKEN}
    entrypoint: ["/bin/bash", "-c"]
    command:
      - |
        echo "🚀 Starting DGX Spark Infrastructure..."
        
        pip install "huggingface_hub<1.0" > /dev/null 2>&1
        
        echo "🔵 Launching 20B..."
        /model_engines/launch_20b.sh &
        
        sleep 10
        
        echo "🟣 Launching 120B..."
        /model_engines/launch_120b.sh &
        
        echo "✅ System online. Streaming logs..."
        tail -f /var/log/gpt_oss_20b_server.log /var/log/gpt_120b_server.log

5. TensorRT-LLM Serve Launch Scripts

GPT-OSS-20B Launch Script

#!/bin/bash
set -e
MODEL_HANDLE="openai/gpt-oss-20b"
TIKTOKEN_DIR="/tmp/harmony-reqs"
CONFIG_FILE="/tmp/extra-llm-api-config-20b.yml"
LOG_FILE="/var/log/gpt_oss_20b_server.log"
PORT=8355

echo "🌟 STRICT STARTUP (NVIDIA DOC) - PORT $PORT"

export TIKTOKEN_ENCODINGS_BASE="$TIKTOKEN_DIR"
mkdir -p $TIKTOKEN_DIR
if [ ! -f "$TIKTOKEN_DIR/cl100k_base.tiktoken" ]; then
    wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
    wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
fi

echo "📥 Verifying files..."
python3 -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='$MODEL_HANDLE')"

cat > $CONFIG_FILE <<YAML
print_iter_log: false
kv_cache_config:
  dtype: "auto"
  free_gpu_memory_fraction: 0.5
cuda_graph_config:
  enable_padding: true
disable_overlap_scheduler: true
YAML

echo "🔥 Launching trtllm-serve..."
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

nohup trtllm-serve "$MODEL_HANDLE" \
  --host 0.0.0.0 \
  --port $PORT \
  --max_batch_size 64 \
  --trust_remote_code \
  --extra_llm_api_options $CONFIG_FILE \
  > $LOG_FILE 2>&1 &

echo "🎉 Launched. Logs: tail -f $LOG_FILE"

GPT-OSS-120B Launch Script

#!/bin/bash
set -e
MODEL_HANDLE="openai/gpt-oss-120b"
TIKTOKEN_DIR="/tmp/harmony-reqs"
CONFIG_FILE="/tmp/extra-llm-api-config-120b.yml"
LOG_FILE="/var/log/gpt_120b_server.log"
PORT=8356

echo "🌟 STRICT STARTUP (NVIDIA DOC) - PORT $PORT"

export TIKTOKEN_ENCODINGS_BASE="$TIKTOKEN_DIR"
mkdir -p $TIKTOKEN_DIR
if [ ! -f "$TIKTOKEN_DIR/cl100k_base.tiktoken" ]; then
    wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
    wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
fi

echo "📥 Verifying files..."
python3 -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='$MODEL_HANDLE')"

cat > $CONFIG_FILE <<YAML
print_iter_log: false
kv_cache_config:
  dtype: "auto"
  free_gpu_memory_fraction: 0.4
cuda_graph_config:
  enable_padding: true
disable_overlap_scheduler: true
YAML

echo "🔥 Launching trtllm-serve..."
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

nohup trtllm-serve "$MODEL_HANDLE" \
  --host 0.0.0.0 \
  --port $PORT \
  --max_batch_size 32 \
  --trust_remote_code \
  --extra_llm_api_options $CONFIG_FILE \
  > $LOG_FILE 2>&1 &

echo "🎉 Launched. Logs: tail -f $LOG_FILE"

6. Benchmarking Analysis & Performance

Benchmarking Procedure (genai-perf)

The reference tool is GenAI-Perf, executed from the SDK client container.

Sweep Benchmark Script

#!/bin/bash
set -e
MODEL="gpt-oss-20b"
PORT=8355
TOKENIZER="openai/gpt-oss-120b"
OUTPUT_DIR="artifacts/sweep_data"
mkdir -p $OUTPUT_DIR

echo "🚀 Starting Sweep on $MODEL (Port $PORT)..."

for CONCURRENCY in 1 2 4 8 16 32 64 128; do
    echo "------------------------------------------------"
    echo "🧪 Testing with Concurrency: $CONCURRENCY"
    echo "------------------------------------------------"
    
    genai-perf profile \
      -m "$MODEL" \
      --endpoint-type chat \
      --url localhost:$PORT \
      --concurrency $CONCURRENCY \
      --streaming \
      --synthetic-input-tokens-mean 128 \
      --synthetic-input-tokens-stddev 20 \
      --output-tokens-mean 128 \
      --output-tokens-stddev 20 \
      --tokenizer "$TOKENIZER" \
      --num-prompts 50 \
      --artifact-dir "$OUTPUT_DIR/c${CONCURRENCY}" > /dev/null 2>&1
      
    echo "✅ Complete."
done

echo "🎉 Sweep complete! Data ready for graphical analysis."

Limitations of the “Dev Preview” Version (25.10)

⚠️ Caveat: The Docker image employed (nvcr.io/nvidia/tritonserver:25.10-py3-igpu-sdk) is a “bleeding edge” version that contains incomplete or buggy functionalities.

Functionality Status Error Encountered
genai-perf analyze ❌ Non-functional Python crash: AttributeError: 'Namespace' object has no attribute 'sweep_min'
genai-perf compare ❌ Absent Command not recognised by the binary
--generate-plots ❌ Non-functional IO crash: FileNotFoundError: .../plots/config.yaml (Directory not created)

Consequence: The tool generates raw data (CSV) perfectly, but is incapable of producing graphs or visual comparisons.

Solution: Custom Visualisation Script

To address these shortcomings, we employ the Python script plot_genai.pyor plot_advanced which parses the CSVs and generates the graphs.

Usage:

  • Analyse a single benchmark:

    python3 plot_genai.py profile artifacts/my_result/profile_export.csv
    
  • Compare two models (e.g., 20B vs 120B):

    python3 plot_genai.py compare \
      --files artifacts/result_20b.csv artifacts/result_120b.csv \
      --labels "Fast 20B" "Brain 120B"
    

7. Results

GPT-OSS-120B at Concurrency 1

Metric Average P95
Time To First Token (ms) 1,157.61 1,464.95
Request Latency (ms) 4,030.15 4,897.14
Output Token Throughput (tokens/sec) 23.66

GPT-OSS-120B at Concurrency 32

Metric Average P95
Time To First Token (ms) 23,862.06 33,187.09
Request Latency (ms) 30,565.06 40,744.92
Output Token Throughput (tokens/sec) 75.35

GPT-OSS-120B at Concurrency 128

Metric Average P95
Time To First Token (ms) 105,049.49 142,049.41
Request Latency (ms) 111,618.57 149,083.30
Output Token Throughput (tokens/sec) 76.81

I must confess that I encountered considerable difficulty with my plotting scripts, particularly regarding the comma separator appearing both within strings and as numerical values separator. Consequently, I would advise against treating these figures as definitive, as I have yet to conduct thorough benchmarks and cross-validate my results against other established tools.


8. Interpretation of Results: The “Lorry” versus the “Formula 1”

The benchmarks conducted on the GPT-OSS-20B and the 120B reveal the true nature of the NVIDIA GB10 (Blackwell) processor.

The Analogy

  • A Formula 1 (Gaming/Consumer GPU): Designed to travel extremely fast with a single passenger. It exhibits very low latency but saturates rapidly when weight is added.

  • A Heavy Goods Vehicle (DGX Spark / GB10): Designed to transport 50 tonnes. It starts more slowly (higher latency), but whether carrying 1 or 50 passengers, it maintains the same speed without deceleration.

This architectural characteristic makes the GB10 exceptionally well-suited for production workloads where consistent throughput under varying concurrent loads is paramount, rather than minimising single-request latency.


Wish you all the best in you AI adventure,

William

1 Like

You might want to look into SGLang or at least vLLM. Or llama.cpp if you don’t care about concurrency too much (although it can handle it). All of these can serve gpt-oss-120b with much better performance. 24 t/s for gpt-oss-120b on Spark is way too slow.

You can expect:

  • 35 t/s from vLLM
  • 52 t/s from SGLang (using their lmsys/sglang:spark image)
  • 58 t/s from llama.cpp

(these are single request numbers, concurrent would be higher (for instance, I get peak 125 t/s throughput on 10 concurrent requests with SGLang).

gpt-oss-20b will be significantly faster, however with gpt-oss-120b that fast I wouldn’t even bother with it. If you need a second model, you can run one of qwen3-vl ones for vision input, for instance.

5 Likes

this is the way

Hello,

You are right. Since then, I’ve done more bench tests with vLLM, and it was very good. However, I can’t say for sure that using vLLM instead of TRT-LLM gives me better stability for the tokens/second (t/s) under concurrent load. I have yet to implement SGLang, but I’ve read good things about it online (cf. article: NVIDIA DGX Spark Review: The AI Appliance Bringing Datacenter Capabilities to Desktops - StorageReview.com).

Thank you for your suggestion. I will try SGLang and run more bench tests when I have time during the holidays.

Wish you all happy festivities.