Hello NVIDIA Community,
I am using a MIG-enabled A100 GPU on a Kubernetes cluster and installed the NVIDIA device plugin for Kubernetes. However, I am encountering an issue where the Kubernetes scheduler doesn’t honour the specific MIG instance I intend to run on for my application. My challenge is ensuring that a specific MIG instance is selected for a pod. Outside of Kubernetes, I can use CUDA_VISIBLE_DEVICES
to specify the MIG instance like this:
CUDA_VISIBLE_DEVICES=MIG-GPU-e88cb44c-6756-fd30-cd4a-1e6da3ca88b0 ./application
However, when I request a MIG instance using resource limits in the pod YAML file, such as:
resources:
limits:
nvidia.com/mig-1g.5gb: 1
Even though I set the CUDA_VISIBLE_DEVICES
environment variable:
env:
- name: CUDA_VISIBLE_DEVICES
value: "MIG-GPU-e88cb44c-6756-fd30-cd4a-1e6da3ca88b0"
The results of the ’ kubectl exec -it gpu-pod – nvidia-smi -L’ show that the scheduler still assigns the pod to the first available MIG instance without respecting the specific MIG ID I provided. I’m not sure if this is an issue with the NVIDIA device plugin or if it’s related to how the Kubernetes scheduler handles MIG instances.
Has anyone encountered a similar issue or found a solution to ensure that the correct MIG instance is assigned to a pod in Kubernetes when using CUDA_VISIBLE_DEVICES? Any suggestions or insights would be greatly appreciated!
Thanks for your help!