Running DGX Spark as a Unified‑Memory Inference Fabric
I’ve been experimenting with DGX Spark as a unified‑memory inference fabric rather than a single‑request GPU box, and I wanted to share some architecture notes for others who are pushing Spark beyond the usual vLLM defaults.
Setup
-
DGX Spark (system updates as of 5/5‑2026)
-
vLLM
0.19.0+6bc3197f.nv26.04.48680843 -
Qwen 3.6 35B A3B FP8
-
Docker 29.2.1
-
Custom Kestrel‑based .NET 10 Web API as the orchestration layer
The API surface is intentionally tiny. The client only sees:
-
AllocateP0PriorityAsync(seconds)— the interactive switch -
RunP0Async(...) -
RunP1Async(...) -
EnqueueP2Async(...) -
AwaitAsync(jobId)
Everything else — scheduling, gating, token budgets, concurrency, background control — is internal.
Scheduler
The scheduler maintains learned ceilings for P0/P1/P2, predicted in‑flight token load, admission reasons, and execution readiness.
Its only job is to keep Spark continuously fed without overrunning unified memory or starving the GPU.
Grace‑Blackwell coherence makes this easier than on discrete GPUs.
Optimizer
The optimizer runs only in background mode and uses:
-
a filtered finished‑job throughput signal (45s window)
-
a 3s cadence
-
a quasi‑Newton update step
-
a token‑budget model
-
active P2 gating
The goal is stable sustained throughput under real workloads, not chasing spikes.
Unified Memory Observations
For anyone running Spark seriously, a few things stood out:
-
Zero‑copy unified memory eliminates the starvation patterns you see on PCIe GPUs.
-
Concurrency is stable when predicted token load is respected.
-
Memory imports (for my memory server) don’t collapse throughput when scheduled correctly.
-
Spark performs best when continuously fed — idle gaps hurt more than concurrency.
This matches NVIDIA’s documentation, but it’s interesting to see it in practice.
Memory Server Ingestion
A small fraction of Spark’s capacity is used to ingest and rewrite memory tiles for a new memory server (import → extract → rewrite → merge → store).
This runs concurrently with inference without noticeable interference when scheduled properly.
Throughput
With this orchestration layer, Spark reaches sustained ~3000 tokens/sec under continuous background workloads with memory ingestion active.
This isn’t a benchmark claim — just an observation of how Spark behaves when treated as a unified‑memory inference appliance with proper scheduling.
Screenshot
Closing
If anyone else is building custom schedulers, unified‑memory‑aware pipelines, or front ends on top of Spark, I’d be interested in comparing notes. Spark’s architecture rewards continuous, well‑scheduled workloads more than I expected.
