Hi, I’m running Kubernetes with the latest nvidia/k8s-device-plugin:v0.17.1
on a Jetson AGX Orin node (JetPack 6.2), and I encountered a persistent crash issue when enabling the MPS Control Daemon.
I know that JetPack 6.1 has added official support for MPS on Jetson, and I can confirm that MPS works correctly on my Jetson AGX Orin outside of containers — for example, I’m able to manually start the MPS control daemon using nvidia-cuda-mps-control -d
on the host without any issues.
However, when I deploy the nvidia/k8s-device-plugin:v0.17.1
with MPS enabled and set replicas: 4
for MPS control, I noticed something unexpected: when I check the available GPU resources via:
kubectl describe node NODE | grep ``nvidia.com/gpu
Kubernetes reports 8 GPUs, which is inconsistent with my actual hardware and the expected number of MPS partitions. I was expecting to see 4 logical GPUs corresponding to the replicas: 4
setting.
In addition, when I run:
kubectl get pods -n nvidia-device-plugin -o wide
I see that some of the nvidia-device-plugin-mps-control-daemon
pods are in a CrashLoopBackOff
state.
The detailed error message from the logs of container mps-control-daemon-ctr
is:
E0904 09:35:53.754750 312 main.go:84] error starting plugins: error getting daemons: error building device map: error building device map from config.resources: error building GPU device map: error visiting device: error building Device: error getting device paths: error getting GPU device minor number: Not Supported
So I’d like to ask:
Is it currently feasible to share GPU resources via MPS on Jetson Orin devices in Kubernetes?
This method works well on traditional x86 servers with discrete GPUs (e.g., RTX or A100 series), but I’m unsure whether the same applies to Jetson-class embedded GPUs. If not supported, is there any recommended workaround or roadmap for enabling this?