# NVIDIA Device Plugin Compatibility Issue with Blackwell GB10 GPU
## Issue Summary
The NVIDIA device plugin (v0.18.0) fails to properly allocate GPU resources in Kubernetes for the Blackwell GB10 GPU, despite the GPU being functional at the host level. The plugin registers successfully but encounters NVML “Not Supported” errors during device discovery and health checks, preventing proper GPU pod scheduling.
## Environment Details
### Hardware
- **GPU Model**: NVIDIA Blackwell GB10
- **Host System**: DGX Spark with Blackwell GPU
- **Node**: spark2 (10.1.10.202)
### Software Versions
- **NVIDIA Driver**: 580.95.05
- **NVIDIA Container Toolkit**: Latest (via GPU Operator)
- **Kubernetes**: K3s v1.30.x
- **NVIDIA Device Plugin**: v0.18.0 (both standalone and GPU Operator)
- **Operating System**: Ubuntu 22.04 LTS
- **Architecture**: x86_64
### Host GPU Status
```bash
$ nvidia-smi
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 12.4 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-Mode | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GB10 On | 00000000:0F:01.0 Off | 0 |
| N/A 32C P8 16W / 300W | 0MiB / 49152MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
```
## Problem Description
### Expected Behavior
- NVIDIA device plugin should discover the Blackwell GPU
- Node should report GPU capacity and allow pod scheduling with `nvidia.com/gpu` resource requests
- GPU workloads should run successfully in Kubernetes pods
### Actual Behavior
- Device plugin starts and registers with kubelet
- Node shows GPU capacity (1) and allocatable (1)
- GPU pod scheduling fails with “Insufficient nvidia.com/gpu” error
- Plugin logs show “Not Supported” errors for NVML memory queries
- Manual device mounting works, but standard Kubernetes GPU scheduling fails
### Error Messages
#### Device Plugin Logs
```
I1023 05:24:09.975062 1 main.go:360] Retrieving plugins.
W1023 05:24:11.111677 1 devices.go:77] Ignoring error getting device memory: Not Supported
I1023 05:24:11.114856 1 server.go:197] Starting GRPC server for ‘nvidia.com/gpu’
I1023 05:24:11.115622 1 server.go:141] Starting to serve ‘nvidia.com/gpu’ on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I1023 05:24:11.117390 1 server.go:148] Registered device plugin for ‘nvidia.com/gpu’ with Kubelet
I1023 05:24:11.118537 1 health.go:64] Ignoring the following XIDs for health checks: map[13:true 31:true 43:true 45:true 68:true 109:true]
```
#### Kubelet Device Plugin Manager Logs
```
I1023 05:24:11.116348 601127 server.go:158] “Got registration request from device plugin with resource” resourceName=“nvidia.com/gpu”
E1023 05:24:11.193566 601127 client.go:90] “ListAndWatch ended unexpectedly for device plugin” err=“rpc error: code = Unavailable desc = error reading from server: EOF” resource=“nvidia.com/gpu”
```
#### Pod Scheduling Error
```
Warning FailedScheduling 8s default-scheduler 0/3 nodes are available: 1 Insufficient nvidia.com/gpu, 2 node(s) didn’t match Pod’s node affinity/selector. preemption: 0/3 nodes are available: 1 No preemption victims found for incoming pod, 2 Preemption is not helpful for scheduling.
```
## Steps Taken to Troubleshoot
### 1. Initial Setup
- Installed NVIDIA driver 580.95.05 on host
- Verified GPU functionality with `nvidia-smi`
- Installed NVIDIA device plugin v0.18.0 as DaemonSet
- Configured privileged access and proper mounts
### 2. Configuration Attempts
- **Discovery Strategy**: Tried `nvml` (default), `auto`
- **Device List Strategy**: Tried `envvar`, `cdi-annotations`
- **Device ID Strategy**: Tried `uuid`, `index`
- **Mounts**: Added `/dev`, `/usr/lib/x86_64-linux-gnu`, `/etc/ld.so.cache`
- **Environment Variables**: Set `NVIDIA_DRIVER_ROOT=/driver-root`, `FAIL_ON_INIT_ERROR=false`
### 3. Alternative Solutions Tested
- **GPU Operator**: Installed NVIDIA GPU Operator v24.x - same errors
- **Manual Mounting**: Created privileged pods with direct device mounts - works
- **Different Plugin Versions**: Tested both standalone plugin and GPU Operator versions
### 4. Root Cause Analysis
- NVML library loads successfully
- Device enumeration works (capacity shows 1)
- Memory-related NVML calls fail with “Not Supported”
- Health checks may be failing due to memory query failures
- Device plugin registers but ListAndWatch stream terminates unexpectedly
## Workaround Implemented
Since standard GPU scheduling fails, implemented manual device mounting:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
spec:
nodeSelector:
kubernetes.io/hostname: spark2
containers:
- name: gpu-container
image: nvidia/cuda:12.0-base
securityContext:
privileged: true
volumeMounts:
- name: nvidia-dev
mountPath: /dev/nvidia0
- name: nvidia-libs
mountPath: /usr/lib/x86_64-linux-gnu
# … other mounts
volumes:
- name: nvidia-dev
hostPath:
path: /dev/nvidia0
# … other volumes
```
This provides GPU access but bypasses Kubernetes resource management.
## Requested Support
Please investigate and provide:
1. **Compatibility Status**: Is Blackwell GB10 officially supported in device plugin v0.18.0?
2. **Driver Requirements**: Are there specific driver versions required for Blackwell support?
3. **Fix Timeline**: When will Blackwell NVML memory queries be supported?
4. **Workaround Guidance**: Official recommendations for Blackwell GPU scheduling in Kubernetes
## Additional Information
- **Release Notes Reference**: NVIDIA Device Plugin v0.18.0 release notes mention “Added support for NVIDIA Blackwell GPUs”
- **Driver Notes**: Driver 580.95.05 is a recent release with Blackwell support
- **Impact**: Prevents production deployment of GPU workloads on Blackwell hardware in Kubernetes
## Log Files Attached
- Device plugin logs (full startup sequence)
- Kubelet device plugin manager logs
- Pod scheduling events
- Node capacity/allocatable status
- nvidia-smi output
- Driver installation logs
Please let me know if additional diagnostic information is needed.
nvidia-blackwell-support-files.tar.gz (571.4 KB)