MPS Support and Telemetry on Grace Blackwell (GB10) with Unified Memory

gerard.carreras.mallol · March 11, 2026, 10:22am

Hello NVIDIA Community,

I am currently working with a NVIDIA GB10 (Grace Blackwell) system in a Kubernetes environment (v1.28+) using the NVIDIA GPU Operator and driver version 580.95.

My current setup uses GPU Time-Slicing, but I am facing significant limitations regarding telemetry. Specifically, dcgm-exporter provides “mirrored” metrics (identical utilization for all pods) and fails to report VRAM usage (reporting 0MB or N/A), likely due to the Unified Memory architecture of the GB10.

Before attempting a migration, I would like to confirm:

Official MPS Support: Is NVIDIA MPS officially supported on the GB10 architecture? I’ve noticed it is missing from some “Supported GPUs” lists in the documentation, even though it is a Compute Capability 10.0 device.
Resource Isolation: Does MPS on GB10 allow for strict memory/compute limits per pod given the shared CPU-GPU memory pool (128GB)?
Monitoring: Will switching to MPS solve the “mirrored metrics” issue in DCGM, or is the telemetry for GB10 still under development?

System Details:

GPU: NVIDIA GB10 (Grace Blackwell)
Memory: 128GB Unified Memory
Driver: 580.95
CUDA: 12.x

Any guidance or roadmap regarding monitoring Blackwell-based systems in K8s would be greatly appreciated. Thanks!

rs277 · March 11, 2026, 6:10pm

Just confirming you are using at least Cuda 12.9.0 which introduced support for CC12.1 Spark/GB10.

There’s a forum for these here.

gerard.carreras.mallol · March 13, 2026, 10:21am

Thanks for the clarification. I can confirm that I am already running CUDA 13.0 (V13.0.88) with driver 580.95.

Despite being on the latest toolchain, the issue persists: dcgm-exporter still provides mirrored metrics and fails to report VRAM usage for the GB10.

I’ve been informed in another official channel that there are “no plans to support DCGM on Spark”. Since DCGM is the standard for Kubernetes telemetry, this leaves us in a difficult position.

Is there any other official NVIDIA path or a specific NVML-based exporter that supports memory attribution for the Grace Blackwell Unified Memory architecture? We need to differentiate utilization per Pod, and currently, even with the latest CUDA 13, the hardware remains a “black box” for monitoring.

parallelArchitect · March 23, 2026, 2:32am

The NVML memory reporting issue on GB10 is a known gap — nvmlDeviceGetMemoryInfo returns NVML_ERROR_NOT_SUPPORTED because there’s no discrete framebuffer.

There’s a community shim that intercepts NVML calls and falls back to CUDA runtime + /proc/meminfo for unified memory systems: https://github.com/CINOAdam/nvml-unified-shim

It works as an LD_PRELOAD drop-in — no application changes needed. It’s been tested with MAX Engine and nvtop on GB10.

More context on the unified memory reporting gap: https://forums.developer.nvidia.com/t/nvml-support-for-dgx-spark-grace-blackwell-unified-memory-community-solution/358869

gerard.carreras.mallol · March 23, 2026, 12:08pm

Hi,

Thank you so much for the detailed explanation! Confirmed: we were indeed hitting the NVML_ERROR_NOT_SUPPORTED due to the lack of a discrete framebuffer on the GB10. The nvml-unified-shim sounds like exactly what we need to bridge the reporting gap for memory.

Quick follow-up question: regarding GPU utilization (SM occupancy/load) per Pod, since we are using Time-Slicing/MPS, we often see “mirrored” metrics or aggregated load across all containers. Is there a similar shim or a specific NVML field that can reliably report the actual compute load per context/process on Grace Blackwell systems?

Thanks again for the community-driven solutions!

Topic		Replies	Views
NVML Support for DGX Spark Grace Blackwell Unified Memory - Community Solution DGX Spark / GB10 Projects cuda , kernel	7	618	April 4, 2026
DCGM-Exporter: Missing Process-level Attribution for GPU Time-Slicing on Blackwell GB10 DGX Spark / GB10	3	124	March 11, 2026
Nsight Systems: Unified Memory Trace Support for GB10 (SM121) DGX Spark / GB10 nsight , feature-engineering , spark	6	287	February 5, 2026
Tesla V100 doesn't support GPU-Metrics Collection Profiling Linux Targets	2	198	August 8, 2024
Monitoring GPUs in Kubernetes with DCGM Technical Blog	8	1888	May 24, 2024
Unable to get gpu metrics on Quadro GV100 Profiling Linux Targets	3	564	January 5, 2024
Blackwell GB10 GPU, device plugin v0.18.0, driver 580.95.05 DGX Spark / GB10	10	846	October 29, 2025
DGX Dashboard metrics DGX Spark / GB10	6	730	October 27, 2025
Issue with GPU Metrics Collection for NVIDIA A100 on Nsight Systems Profiling Linux Targets profiling	12	1335	June 5, 2024
Does TU102 [GeForce RTX 2080 Ti] support MPS for nvidia-docker? CUDA Programming and Performance	1	990	May 21, 2021

MPS Support and Telemetry on Grace Blackwell (GB10) with Unified Memory

Related topics