DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference
Greetings to the community,
I am pleased to share my comprehensive documentation regarding the construction of a generative AI stack that leverages the FP4 revolution with the openai/gpt-oss-20b and 120b models.
Introduction
This documentation describes the architecture and deployment of a local generative AI stack on the NVIDIA DGX Spark (GB10). This infrastructure transforms the DGX into an autonomous AI workstation, capable of rivalling cloud-based services (GPT-4/Copilot) whilst guaranteeing complete data sovereignty and negligible latency.
1. Architectural Philosophy: The “Bicephalous System”
Rather than pursuing a single “average-sized model” (such as a 70B), we have opted for an asymmetrical architecture comprising two specialised cognitive units operating in parallel. This approach elegantly resolves the classical Speed versus Intelligence dilemma:
| Role | Model | Function | Key Characteristic |
|---|---|---|---|
| The Brain | GPT-OSS 120B | Reasoning, Architecture, Complex Refactoring | Maximum intelligence, acceptable latency |
| The Sprinter | GPT-OSS 20B | Auto-completion, Rapid Chat, Simple Functions | Minimal latency (<20ms), high throughput |
Both models are served simultaneously via TensorRT-LLM, exposing OpenAI-compatible APIs that any client (VS Code, Open WebUI) may consume.
2. Technical Optimisations (The “Secret Sauce”)
Accommodating 140 billion parameters on a single machine with 128 GB of unified memory constitutes a feat of precision engineering. Herewith are the technical choices that render this possible:
A. MXFP4 Compression (Micro-scaling)
We exploit the Blackwell (GB10) architecture, which natively supports 4-bit format.
- Impact: The 120B model is reduced from approximately 240 GB (FP16) to approximately 70 GB in RAM.
- Performance: Utilisation of specialised Tensor Cores for inference without significant precision degradation.
B. Memory Management (The “VRAM Tetris”)
Memory allocation is calculated to the gigabyte to prevent OOM (Out Of Memory) conditions:
- Slot 1 (120B): Loaded first. Occupies approximately 60% of total memory.
- Slot 2 (20B): Loaded in the remaining space.
- Safety measure: Configuration of
free_gpu_memory_fraction: 0.4(it consumes only 40% of what remains to provide headroom for the OS).
- Safety measure: Configuration of
C. “Eager” Execution Mode
We utilise TensorRT-LLM’s --trust-remote-code mode.
- Advantage: No lengthy and rigid static compilation (
.engine). The engine employs the model’s Python code to construct the execution graph dynamically. - Flexibility: Enables handling of exotic architectures (MoE — Mixture of Experts) without manually patching JSON configuration files.
3. Services & Ports
The infrastructure is exposed on the DGX’s local network:
| Service | Port | API Endpoint | Target Usage |
|---|---|---|---|
| GPT-OSS 120B | 8356 | http://localhost:8356/v1 |
“Senior Architect” in Continue/Cline |
| GPT-OSS 20B | 8355 | http://localhost:8355/v1 |
“Tab Autocomplete” & “Rapid Chat” |
4. Docker Compose Stack
The ensemble is orchestrated by a single container spark_trtllm_production based on the nvcr.io/nvidia/tensorrt-llm:spark-single-gpu-dev image.
Volume Structure
- Model Persistence:
~/.cache/huggingfaceis mounted to prevent re-downloads (70GB+). - Launch Scripts:
~/triton_benchmarks/model_enginescontains the Bash scripts (launch_120b.sh,launch_20b.sh) that control the engines.
Lifecycle Management
- Startup: Automatic (
restart: unless-stopped). - Sequence: The entry script launches the 20B first, waits 10 seconds, then launches the 120B to ensure orderly memory allocation.
- Logs: Centralised via
docker compose logs -f.
Docker Compose Configuration (docker-compose.trtllm.yml)
services:
spark-ai-core:
image: nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
container_name: spark_ai_production
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
network_mode: host
restart: unless-stopped
ulimits:
memlock: -1
stack: 67108864
ipc: host
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
- ~/triton_benchmarks/model_engines:/model_engines
environment:
- HF_TOKEN=${HF_TOKEN}
entrypoint: ["/bin/bash", "-c"]
command:
- |
echo "🚀 Starting DGX Spark Infrastructure..."
pip install "huggingface_hub<1.0" > /dev/null 2>&1
echo "🔵 Launching 20B..."
/model_engines/launch_20b.sh &
sleep 10
echo "🟣 Launching 120B..."
/model_engines/launch_120b.sh &
echo "✅ System online. Streaming logs..."
tail -f /var/log/gpt_oss_20b_server.log /var/log/gpt_120b_server.log
5. TensorRT-LLM Serve Launch Scripts
GPT-OSS-20B Launch Script
#!/bin/bash
set -e
MODEL_HANDLE="openai/gpt-oss-20b"
TIKTOKEN_DIR="/tmp/harmony-reqs"
CONFIG_FILE="/tmp/extra-llm-api-config-20b.yml"
LOG_FILE="/var/log/gpt_oss_20b_server.log"
PORT=8355
echo "🌟 STRICT STARTUP (NVIDIA DOC) - PORT $PORT"
export TIKTOKEN_ENCODINGS_BASE="$TIKTOKEN_DIR"
mkdir -p $TIKTOKEN_DIR
if [ ! -f "$TIKTOKEN_DIR/cl100k_base.tiktoken" ]; then
wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
fi
echo "📥 Verifying files..."
python3 -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='$MODEL_HANDLE')"
cat > $CONFIG_FILE <<YAML
print_iter_log: false
kv_cache_config:
dtype: "auto"
free_gpu_memory_fraction: 0.5
cuda_graph_config:
enable_padding: true
disable_overlap_scheduler: true
YAML
echo "🔥 Launching trtllm-serve..."
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
nohup trtllm-serve "$MODEL_HANDLE" \
--host 0.0.0.0 \
--port $PORT \
--max_batch_size 64 \
--trust_remote_code \
--extra_llm_api_options $CONFIG_FILE \
> $LOG_FILE 2>&1 &
echo "🎉 Launched. Logs: tail -f $LOG_FILE"
GPT-OSS-120B Launch Script
#!/bin/bash
set -e
MODEL_HANDLE="openai/gpt-oss-120b"
TIKTOKEN_DIR="/tmp/harmony-reqs"
CONFIG_FILE="/tmp/extra-llm-api-config-120b.yml"
LOG_FILE="/var/log/gpt_120b_server.log"
PORT=8356
echo "🌟 STRICT STARTUP (NVIDIA DOC) - PORT $PORT"
export TIKTOKEN_ENCODINGS_BASE="$TIKTOKEN_DIR"
mkdir -p $TIKTOKEN_DIR
if [ ! -f "$TIKTOKEN_DIR/cl100k_base.tiktoken" ]; then
wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
fi
echo "📥 Verifying files..."
python3 -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='$MODEL_HANDLE')"
cat > $CONFIG_FILE <<YAML
print_iter_log: false
kv_cache_config:
dtype: "auto"
free_gpu_memory_fraction: 0.4
cuda_graph_config:
enable_padding: true
disable_overlap_scheduler: true
YAML
echo "🔥 Launching trtllm-serve..."
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
nohup trtllm-serve "$MODEL_HANDLE" \
--host 0.0.0.0 \
--port $PORT \
--max_batch_size 32 \
--trust_remote_code \
--extra_llm_api_options $CONFIG_FILE \
> $LOG_FILE 2>&1 &
echo "🎉 Launched. Logs: tail -f $LOG_FILE"
6. Benchmarking Analysis & Performance
Benchmarking Procedure (genai-perf)
The reference tool is GenAI-Perf, executed from the SDK client container.
Sweep Benchmark Script
#!/bin/bash
set -e
MODEL="gpt-oss-20b"
PORT=8355
TOKENIZER="openai/gpt-oss-120b"
OUTPUT_DIR="artifacts/sweep_data"
mkdir -p $OUTPUT_DIR
echo "🚀 Starting Sweep on $MODEL (Port $PORT)..."
for CONCURRENCY in 1 2 4 8 16 32 64 128; do
echo "------------------------------------------------"
echo "🧪 Testing with Concurrency: $CONCURRENCY"
echo "------------------------------------------------"
genai-perf profile \
-m "$MODEL" \
--endpoint-type chat \
--url localhost:$PORT \
--concurrency $CONCURRENCY \
--streaming \
--synthetic-input-tokens-mean 128 \
--synthetic-input-tokens-stddev 20 \
--output-tokens-mean 128 \
--output-tokens-stddev 20 \
--tokenizer "$TOKENIZER" \
--num-prompts 50 \
--artifact-dir "$OUTPUT_DIR/c${CONCURRENCY}" > /dev/null 2>&1
echo "✅ Complete."
done
echo "🎉 Sweep complete! Data ready for graphical analysis."
Limitations of the “Dev Preview” Version (25.10)
⚠️ Caveat: The Docker image employed (nvcr.io/nvidia/tritonserver:25.10-py3-igpu-sdk) is a “bleeding edge” version that contains incomplete or buggy functionalities.
| Functionality | Status | Error Encountered |
|---|---|---|
genai-perf analyze |
❌ Non-functional | Python crash: AttributeError: 'Namespace' object has no attribute 'sweep_min' |
genai-perf compare |
❌ Absent | Command not recognised by the binary |
--generate-plots |
❌ Non-functional | IO crash: FileNotFoundError: .../plots/config.yaml (Directory not created) |
Consequence: The tool generates raw data (CSV) perfectly, but is incapable of producing graphs or visual comparisons.
Solution: Custom Visualisation Script
To address these shortcomings, we employ the Python script plot_genai.pyor plot_advanced which parses the CSVs and generates the graphs.
Usage:
-
Analyse a single benchmark:
python3 plot_genai.py profile artifacts/my_result/profile_export.csv -
Compare two models (e.g., 20B vs 120B):
python3 plot_genai.py compare \ --files artifacts/result_20b.csv artifacts/result_120b.csv \ --labels "Fast 20B" "Brain 120B"
7. Results
GPT-OSS-120B at Concurrency 1
| Metric | Average | P95 |
|---|---|---|
| Time To First Token (ms) | 1,157.61 | 1,464.95 |
| Request Latency (ms) | 4,030.15 | 4,897.14 |
| Output Token Throughput (tokens/sec) | 23.66 | — |
GPT-OSS-120B at Concurrency 32
| Metric | Average | P95 |
|---|---|---|
| Time To First Token (ms) | 23,862.06 | 33,187.09 |
| Request Latency (ms) | 30,565.06 | 40,744.92 |
| Output Token Throughput (tokens/sec) | 75.35 | — |
GPT-OSS-120B at Concurrency 128
| Metric | Average | P95 |
|---|---|---|
| Time To First Token (ms) | 105,049.49 | 142,049.41 |
| Request Latency (ms) | 111,618.57 | 149,083.30 |
| Output Token Throughput (tokens/sec) | 76.81 | — |
I must confess that I encountered considerable difficulty with my plotting scripts, particularly regarding the comma separator appearing both within strings and as numerical values separator. Consequently, I would advise against treating these figures as definitive, as I have yet to conduct thorough benchmarks and cross-validate my results against other established tools.
8. Interpretation of Results: The “Lorry” versus the “Formula 1”
The benchmarks conducted on the GPT-OSS-20B and the 120B reveal the true nature of the NVIDIA GB10 (Blackwell) processor.
The Analogy
-
A Formula 1 (Gaming/Consumer GPU): Designed to travel extremely fast with a single passenger. It exhibits very low latency but saturates rapidly when weight is added.
-
A Heavy Goods Vehicle (DGX Spark / GB10): Designed to transport 50 tonnes. It starts more slowly (higher latency), but whether carrying 1 or 50 passengers, it maintains the same speed without deceleration.
This architectural characteristic makes the GB10 exceptionally well-suited for production workloads where consistent throughput under varying concurrent loads is paramount, rather than minimising single-request latency.
Wish you all the best in you AI adventure,
William
