I am currently working with a NVIDIA GB10 (Grace Blackwell) system in a Kubernetes environment (v1.28+) using the NVIDIA GPU Operator and driver version 580.95.
My current setup uses GPU Time-Slicing, but I am facing significant limitations regarding telemetry. Specifically, dcgm-exporter provides “mirrored” metrics (identical utilization for all pods) and fails to report VRAM usage (reporting 0MB or N/A), likely due to the Unified Memory architecture of the GB10.
Before attempting a migration, I would like to confirm:
Official MPS Support: Is NVIDIA MPS officially supported on the GB10 architecture? I’ve noticed it is missing from some “Supported GPUs” lists in the documentation, even though it is a Compute Capability 10.0 device.
Resource Isolation: Does MPS on GB10 allow for strict memory/compute limits per pod given the shared CPU-GPU memory pool (128GB)?
Monitoring: Will switching to MPS solve the “mirrored metrics” issue in DCGM, or is the telemetry for GB10 still under development?
System Details:
GPU: NVIDIA GB10 (Grace Blackwell)
Memory: 128GB Unified Memory
Driver: 580.95
CUDA: 12.x
Any guidance or roadmap regarding monitoring Blackwell-based systems in K8s would be greatly appreciated. Thanks!
Thanks for the clarification. I can confirm that I am already running CUDA 13.0 (V13.0.88) with driver 580.95.
Despite being on the latest toolchain, the issue persists: dcgm-exporter still provides mirrored metrics and fails to report VRAM usage for the GB10.
I’ve been informed in another official channel that there are “no plans to support DCGM on Spark”. Since DCGM is the standard for Kubernetes telemetry, this leaves us in a difficult position.
Is there any other official NVIDIA path or a specific NVML-based exporter that supports memory attribution for the Grace Blackwell Unified Memory architecture? We need to differentiate utilization per Pod, and currently, even with the latest CUDA 13, the hardware remains a “black box” for monitoring.
The NVML memory reporting issue on GB10 is a known gap — nvmlDeviceGetMemoryInfo returns NVML_ERROR_NOT_SUPPORTED because there’s no discrete framebuffer.
Thank you so much for the detailed explanation! Confirmed: we were indeed hitting the NVML_ERROR_NOT_SUPPORTED due to the lack of a discrete framebuffer on the GB10. The nvml-unified-shim sounds like exactly what we need to bridge the reporting gap for memory.
Quick follow-up question: regarding GPU utilization (SM occupancy/load) per Pod, since we are using Time-Slicing/MPS, we often see “mirrored” metrics or aggregated load across all containers. Is there a similar shim or a specific NVML field that can reliably report the actual compute load per context/process on Grace Blackwell systems?