NVML Support for DGX Spark Grace Blackwell Unified Memory - Community Solution

I’ve been working with the DGX Spark Grace Blackwell GB10 and ran into a significant issue: standard NVML queries fail because GB10 uses unified memory architecture (128GB shared CPU+GPU) rather than discrete GPU with dedicated framebuffer.

Impact:

  • MAX Engine can’t detect GPU: No supported "gpu" device available
  • PyTorch/TensorFlow GPU monitoring fails
  • pynvml library returns NVML_ERROR_NOT_SUPPORTED
  • nvidia-smi shows: Driver/library version mismatch
  • DGX Dashboard telemetry broken

This affects any tool expecting standard NVML on unified memory systems.


Community Solution

I’ve developed an open-source NVML library replacement that solves this:

GitHub Repository: GitHub - CINOAdam/nvml-unified-shim: NVML unified memory shim for NVIDIA DGX Spark Grace Blackwell GB10 - enables MAX Engine, PyTorch, and GPU monitoring

Implementation:

  • Drop-in replacement for libnvidia-ml.so.1
  • Uses CUDA Runtime API + /proc/meminfo for unified memory queries
  • 16 core NVML functions implemented
  • Works with Python ctypes, C/C++ applications

What’s Working:
✅ MAX Engine GPU detection and inference
✅ PyTorch/TensorFlow GPU monitoring
✅ pynvml library
✅ nvidia-smi wrapper
✅ DGX Dashboard telemetry

Installation: CAUTION please only use this if you know what you are doing :-)

git clone https://github.com/CINOAdam/nvml-unified-shim.git
cd nvml-unified-shim
make -f Makefile.python
sudo make -f Makefile.python install

Verification:

python3 -c "from max.driver import Accelerator; print(Accelerator())"
# Output: Device(type=gpu,id=0) ✅

Questions for NVIDIA

This is a working solution for the community, but I’d love guidance from the NVIDIA team:

  1. Official Support: Is NVIDIA planning native NVML support for unified memory architectures (GB10, GH200, GB200)?

  2. Recommended Approach: Is using CUDA Runtime + /proc/meminfo the right long-term approach, or is there a better API?

  3. Semantics: How should GPU utilization be reported on unified memory? (Currently returning 0% since traditional metrics don’t apply)

  4. Collaboration: Would NVIDIA be interested in collaborating on official support or reviewing this implementation?

Technical Details: nvml-unified-shim/NVIDIA_COLLABORATION.md at main · CINOAdam/nvml-unified-shim · GitHub


Hardware Tested

  • System: NVIDIA DGX Spark (Grace Blackwell GB10)
  • Memory: 128GB LPDDR5x unified
  • CUDA: 12.8 / 13.0
  • OS: Ubuntu 24.04 LTS
  • Software: MAX Engine 26.2.0, PyTorch 2.x, TensorFlow 2.x

Should work on other Grace Blackwell systems (GH200, GB200).

Looks cool! I will move this over to GB10 projects

Interesting project — the shim approach is a clever way to keep existing tooling operational.

Platforms like DGX Spark make memory behavior interesting because the CPU and GPU operate on a shared coherent pool rather than separate memory spaces.

Beyond monitoring, one area that may become useful for developers working on these systems is behavioral diagnostics — understanding how memory behaves under real workload pressure.

For example:

• where a working set stops behaving as resident
• when migration begins under heavier load
• whether memory ownership settles or oscillates during pressure

Those signals can help explain performance changes even when overall memory usage appears normal.

It would be interesting to hear whether anyone working directly with DGX Spark has explored diagnostics around residency boundaries or migration stability on the platform yet.

Great work on the NVML shim — this fills a real gap in the GB10 tooling.

I’ve been running a 4-node DGX Spark cluster (4× GB10 over 200GbE RoCE) for the past week deploying Qwen3.5-397B-A17B at TP=4 via vLLM, and I’ve hit a related set of unified memory issues on the inference/allocation side that complement your monitoring findings.

vLLM Memory Allocation Is Broken on Unified Memory

vLLM uses torch.cuda.mem_get_info() (CUDA runtime, not NVML) to determine available GPU memory for KV cache allocation. On GB10, this reports the entire shared CPU/GPU pool as available — including evictable page cache that Linux will happily reclaim. The result:

  • --gpu-memory-utilization becomes a gate, not a cap. At 0.85 it crashes (exceeds profiled free), but values below that (0.73, 0.78) all produce identical KV cache allocations because profiled free memory exceeds the configured cap.
  • Docker --memory cgroup limits have no effect — CUDA unified memory allocations bypass container cgroups entirely on Grace Blackwell.
  • The over-allocation pushes system components (Ray, API server, OS) into swap, causing severe throughput degradation (37 tok/s → 1.8 tok/s in worst case).

Workaround: --num-gpu-blocks-override <N> to directly control KV cache block count, bypassing the broken profiler entirely.

This affects any inference framework using cudaMemGetInfo for memory decisions on unified memory — not just vLLM.

torch.compile + CUDAGraphs = 77% Speedup on ARM

On the GB10’s Grace ARM CPU, Python/CUDA kernel launch overhead is more impactful than on x86. Enabling torch.compile with CUDAGraph replay improved decode throughput from 21 to 37 tok/s on our MoE model. This requires swap headroom (~23GB) for the one-time graph capture phase, after which everything fits in physical RAM.

NCCL Fabric Results (4-node, 200GbE RoCE)

For multi-node context: we achieved 23.89 GB/s busbw on 4-node AllReduce (96% of theoretical line rate). Key finding: NCCL auto-negotiation outperforms manual tuning (Simple/Ring/Tree protocol overrides) inside vLLM by 8-15%, despite manual tuning being optimal in isolated nccl-tests.

To your questions for NVIDIA:

Is NVIDIA planning native NVML support for unified memory architectures?

Based on our experience, the issue goes deeper than NVML. cudaMemGetInfo, Docker cgroups, and the entire memory partitioning model assume discrete GPU memory. There’s currently no NVreg_ parameter, nvidia-smi setting, or environment variable to control GPU/CPU memory split on GB10 — allocation is fully dynamic through the Linux kernel memory manager.

How should GPU utilization be reported on unified memory?

From the inference side, what we actually need is: total physical memory, memory currently allocated by CUDA (distinct from “available to CUDA”), and memory reserved for system/OS (non-evictable). The current cudaMemGetInfo conflates “physically free” with “available to CUDA” which are very different concepts on UMA.

We’ve observed exactly this on a 4-node DGX Spark cluster running Qwen3.5-397B (199GB model, TP=4 across nodes, vLLM).

Short answer: residency boundaries are not stable, migration doesn’t settle, and the behavior is highly sensitive to allocation order during startup.

What we’ve seen in practice

Memory residency is determined by profiling timing, not configuration. vLLM profiles available memory via cudaMemGetInfo during startup to decide KV cache size. On GB10, this reports the entire unified pool minus current allocations — including evictable page cache as “free.” The same recipe on the same hardware produces different KV cache allocations on different runs depending on system state at the exact moment of profiling. We’ve seen 41.99 GiB vs 46.81 GiB KV cache from identical configurations, simply because page cache pressure differed during the profiling window.

Migration under pressure is catastrophic, not gradual. When total CUDA allocations (model weights + KV cache + CUDAGraph buffers) exceed physical RAM, the system doesn’t gracefully degrade — it falls off a cliff. We measured:

  • 0 GiB in swap: 37 tok/s (full speed)
  • 6.7 GiB in swap: 21 tok/s (43% loss)
  • 13 GiB in swap: 1.8 tok/s (95% loss)

The non-linearity comes from AllReduce synchronization across TP ranks — the slowest node (the one swapping) gates the entire cluster. Even a small amount of swap on one node destroys throughput for all nodes.

Ownership does oscillate during pressure. The head node in our Ray cluster runs model weights + KV cache + CUDAGraph buffers + Ray scheduler + API server + idle actors. During CUDAGraph capture (a one-time startup phase), memory temporarily spikes ~1-2 GiB above steady state. On unified memory with no headroom, this spike pushes pages to swap. After capture completes, those pages should return to RAM — but the kernel doesn’t proactively swap them back. They only fault back in on access, creating ongoing latency spikes until you force reclamation with swapoff -a && swapon -a.

Docker cgroups don’t help. We tried --memory=105g on the container to create an artificial residency boundary. CUDA unified memory allocations bypass container cgroups entirely — the limit has no effect on CUDA-side allocation.

Diagnostic approach that worked for us

The most useful signals were:

  • free -h on each node during inference (not just startup) — watching the Swap used column
  • Per-node swap monitoring during AllReduce-heavy phases (every decode step)
  • Comparing nvidia-smi reported memory vs free reported memory — the delta reveals non-CUDA consumers competing for the same physical pool
  • CUDAGraph capture phase as a stress test — if the system survives capture without swapping, steady-state will be fine

The fundamental issue is that there’s no API to set a residency boundary on GB10. You can’t tell the system “reserve 20 GiB for the OS and don’t let CUDA touch it.” Everything is demand-paged through the Linux kernel, and CUDA has no awareness that it’s competing with other consumers for the same physical memory.

This is a thorough breakdown — you’ve documented exactly the failure modes that make unified memory diagnostics difficult on coherent UMA platforms.

Your framing of the core issue is precise:

“physically free” and “available to CUDA” are very different concepts on UMA

That distinction motivated a memory accounting fix I submitted as a community contribution to the nvml-unified-shim project — replacing MemTotal with MemAvailable + SwapFree for memory->total on the UMA fallback path, which is closer to what the allocator actually has available: https://github.com/parallelArchitect/nvml-unified-shim

I’ve also been working on a diagnostic tool that approaches this from the CUDA runtime side:

CUDA Unified Memory Analyzer https://github.com/parallelArchitect/cuda-unified-memory-analyzer It measures fault onset ratio, residency ceiling, and migration pressure empirically on live hardware. The tool detects platform type at runtime (FULL_HARDWARE_COHERENT vs FULL_EXPLICIT) and adjusts accordingly.

GB10 validation is pending — I don’t have Spark hardware. Based on what you’ve described, a coherent UMA run would be directly informative: specifically whether the residency ceiling aligns with MemAvailable rather than MemTotal — which is the boundary your allocator is navigating.

If you’re willing to run it on a GB10 node (ARM64 build required, instructions in the README), the results would be valuable for both of us.

@adi-sonusflow — building on your fork of the analyzer, I’ve published v8.3 of the CUDA Unified Memory Analyzer: GitHub - parallelArchitect/cuda-unified-memory-analyzer: gpu thrashingNVIDIA GPU Unified Memory diagnostic tool — architecture-aware, measurement-based, PCIe/coherent transport detection · GitHub

Your fork — CUPTI signal interpretation, execution model changes, and patches — directly influenced the debugging approach and runtime signal analysis in this version.

This update adds:

  • GB10 / SM 12.1 platform detection

  • per-ratio CUPTI migration tracking (HtoD / DtoH / fault activity)

  • a --cupti-debug mode that emits raw Unified Memory activity records (counter kind, values, timestamps)

On discrete PCIe systems, CUPTI produces dense and consistent activity signals across all passes.

What’s still unclear is how those signals behave on coherent UMA — whether the same activity records are emitted, and how they differ under memory pressure. The debug path is there to make that visible.

Development and testing has been on discrete PCIe hardware. I don’t have access to GB10 hardware directly. Results and observations from coherent UMA systems would help validate the signal model and further refine the tool for the community.


Didn’t Nvidia patch this already? If not, I’m sorry I haven’t had time to get through complete working version released, I’ve been so busy. I will drop a write up on my blog this weekend and the full version in GitHub as soon as I have time. Assuming it’s needed.