How to Select a Specific MIG Instance in the Kubernetes Pod

Hello NVIDIA Community,

I am using a MIG-enabled A100 GPU on a Kubernetes cluster and installed the NVIDIA device plugin for Kubernetes. However, I am encountering an issue where the Kubernetes scheduler doesn’t honour the specific MIG instance I intend to run on for my application. My challenge is ensuring that a specific MIG instance is selected for a pod. Outside of Kubernetes, I can use CUDA_VISIBLE_DEVICES to specify the MIG instance like this:

CUDA_VISIBLE_DEVICES=MIG-GPU-e88cb44c-6756-fd30-cd4a-1e6da3ca88b0 ./application 

However, when I request a MIG instance using resource limits in the pod YAML file, such as:

resources:
  limits:
    nvidia.com/mig-1g.5gb: 1

Even though I set the CUDA_VISIBLE_DEVICES environment variable:

env:
  - name: CUDA_VISIBLE_DEVICES
    value: "MIG-GPU-e88cb44c-6756-fd30-cd4a-1e6da3ca88b0"

The results of the ’ kubectl exec -it gpu-pod – nvidia-smi -L’ show that the scheduler still assigns the pod to the first available MIG instance without respecting the specific MIG ID I provided. I’m not sure if this is an issue with the NVIDIA device plugin or if it’s related to how the Kubernetes scheduler handles MIG instances.

Has anyone encountered a similar issue or found a solution to ensure that the correct MIG instance is assigned to a pod in Kubernetes when using CUDA_VISIBLE_DEVICES? Any suggestions or insights would be greatly appreciated!

Thanks for your help!

Hi everyone,

After some extensive searching and testing, I found that the issue I was facing with specifying a specific MIG instance in my Kubernetes pod was resolved by using NVIDIA_VISIBLE_DEVICES instead of CUDA_VISIBLE_DEVICES.

Difference Between Variables

  • CUDA_VISIBLE_DEVICES: This environment variable is commonly used in traditional CUDA applications to control which GPUs are visible to your application. However, it does not directly handle GPU resources in a Kubernetes environment, especially with MIG instances.

  • NVIDIA_VISIBLE_DEVICES: This variable is specific to the NVIDIA container toolkit and is designed for use in containerized environments like Kubernetes. It manages GPU access more effectively and allows you to specify which MIG instances should be visible to your pod.

Solution
To ensure that a specific MIG instance is selected for my application, I updated my pod’s YAML file to include the NVIDIA_VISIBLE_DEVICES . By doing this, I was able to control which MIG instance my pod used, resolving the initial issue where the Kubernetes scheduler assigned the pod to the first available MIG instance regardless of the specific instance ID I provided.

I hope this helps anyone facing a similar challenge!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.