MPS Support and Telemetry on Grace Blackwell (GB10) with Unified Memory

Hello NVIDIA Community,

I am currently working with a NVIDIA GB10 (Grace Blackwell) system in a Kubernetes environment (v1.28+) using the NVIDIA GPU Operator and driver version 580.95.

My current setup uses GPU Time-Slicing, but I am facing significant limitations regarding telemetry. Specifically, dcgm-exporter provides “mirrored” metrics (identical utilization for all pods) and fails to report VRAM usage (reporting 0MB or N/A), likely due to the Unified Memory architecture of the GB10.

Before attempting a migration, I would like to confirm:

  1. Official MPS Support: Is NVIDIA MPS officially supported on the GB10 architecture? I’ve noticed it is missing from some “Supported GPUs” lists in the documentation, even though it is a Compute Capability 10.0 device.

  2. Resource Isolation: Does MPS on GB10 allow for strict memory/compute limits per pod given the shared CPU-GPU memory pool (128GB)?

  3. Monitoring: Will switching to MPS solve the “mirrored metrics” issue in DCGM, or is the telemetry for GB10 still under development?

System Details:

  • GPU: NVIDIA GB10 (Grace Blackwell)

  • Memory: 128GB Unified Memory

  • Driver: 580.95

  • CUDA: 12.x

Any guidance or roadmap regarding monitoring Blackwell-based systems in K8s would be greatly appreciated. Thanks!

Just confirming you are using at least Cuda 12.9.0 which introduced support for CC12.1 Spark/GB10.

There’s a forum for these here.

Thanks for the clarification. I can confirm that I am already running CUDA 13.0 (V13.0.88) with driver 580.95.

Despite being on the latest toolchain, the issue persists: dcgm-exporter still provides mirrored metrics and fails to report VRAM usage for the GB10.

I’ve been informed in another official channel that there are “no plans to support DCGM on Spark”. Since DCGM is the standard for Kubernetes telemetry, this leaves us in a difficult position.

Is there any other official NVIDIA path or a specific NVML-based exporter that supports memory attribution for the Grace Blackwell Unified Memory architecture? We need to differentiate utilization per Pod, and currently, even with the latest CUDA 13, the hardware remains a “black box” for monitoring.

The NVML memory reporting issue on GB10 is a known gap — nvmlDeviceGetMemoryInfo returns NVML_ERROR_NOT_SUPPORTED because there’s no discrete framebuffer.

There’s a community shim that intercepts NVML calls and falls back to CUDA runtime + /proc/meminfo for unified memory systems: https://github.com/CINOAdam/nvml-unified-shim

It works as an LD_PRELOAD drop-in — no application changes needed. It’s been tested with MAX Engine and nvtop on GB10.

More context on the unified memory reporting gap: https://forums.developer.nvidia.com/t/nvml-support-for-dgx-spark-grace-blackwell-unified-memory-community-solution/358869

Hi,

Thank you so much for the detailed explanation! Confirmed: we were indeed hitting the NVML_ERROR_NOT_SUPPORTED due to the lack of a discrete framebuffer on the GB10. The nvml-unified-shim sounds like exactly what we need to bridge the reporting gap for memory.

Quick follow-up question: regarding GPU utilization (SM occupancy/load) per Pod, since we are using Time-Slicing/MPS, we often see “mirrored” metrics or aggregated load across all containers. Is there a similar shim or a specific NVML field that can reliably report the actual compute load per context/process on Grace Blackwell systems?

Thanks again for the community-driven solutions!