MIG-GPU Support in Kubernetes

Is it possible to modify the code to support MIG-GPU in K8S deployment?

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

May I know which code do you mention?

We checked some codes inside the container folder /opt/api. Only ‘nvidia.com/gpu’ is referred. So, I wonder if we could manually add ‘nvidia.com/mig-xx.xxxx’ to support MIG-GPU.

May I know which container is it?

Two containers(one is xxx-api-pod & one is xxx-workflow-pod) are deployed from same image nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-beta-api by the Helm chart. We only found /opt/api/job_utils/executor.py contains ‘nvidia.com/gpu’. However, even we modified them to ‘MIG-GPU label’, ‘nvidia.com/gpu’ is still declared in the TAO created job.

So, currently, you already setup successfully according to Setup — TAO Toolkit 3.22.05 documentation but failed in Deployment — TAO Toolkit 3.22.05 documentation ?

The tao-toolkit-api is successfully deployed. The two pods and service are running normally. However, the worker node is configured to run MIG. We couldn’t create any job as the pod is pending due to resource limit(i.e. “Insufficient nvidia.com/gpu”).

We found a temporarily solution.

  1. Modify those file which contains “nvidia.com/gpu” to MIG-GPU label.
  2. Perform docker commit
  3. Modify the helm template to pull the updated image
  4. Perform helm uninstall & install

Glad to know you have a solution for it. More reference here: GPU Operator with MIG — NVIDIA Cloud Native Technologies documentation