MIG-GPU Support in Kubernetes

joseph.pang · June 20, 2022, 8:41am

Is it possible to modify the code to support MIG-GPU in K8S deployment?

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Morganh · June 20, 2022, 10:24am

May I know which code do you mention?

joseph.pang · June 21, 2022, 1:02am

We checked some codes inside the container folder /opt/api. Only ‘nvidia.com/gpu’ is referred. So, I wonder if we could manually add ‘nvidia.com/mig-xx.xxxx’ to support MIG-GPU.

Morganh · June 21, 2022, 7:09am

May I know which container is it?

joseph.pang · June 21, 2022, 7:31am

Two containers(one is xxx-api-pod & one is xxx-workflow-pod) are deployed from same image nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-beta-api by the Helm chart. We only found /opt/api/job_utils/executor.py contains ‘nvidia.com/gpu’. However, even we modified them to ‘MIG-GPU label’, ‘nvidia.com/gpu’ is still declared in the TAO created job.

Morganh · June 21, 2022, 7:38am

So, currently, you already setup successfully according to Setup — TAO Toolkit 3.22.05 documentation but failed in Deployment — TAO Toolkit 3.22.05 documentation ?

joseph.pang · June 21, 2022, 8:22am

The tao-toolkit-api is successfully deployed. The two pods and service are running normally. However, the worker node is configured to run MIG. We couldn’t create any job as the pod is pending due to resource limit(i.e. “Insufficient nvidia.com/gpu”).

joseph.pang · June 22, 2022, 8:13am

We found a temporarily solution.

Modify those file which contains “nvidia.com/gpu” to MIG-GPU label.
Perform docker commit
Modify the helm template to pull the updated image
Perform helm uninstall & install

Morganh · June 26, 2022, 4:01pm

Glad to know you have a solution for it. More reference here: GPU Operator with MIG — NVIDIA Cloud Native Technologies documentation

system · July 11, 2022, 5:38am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Training in TAO 3.0 in DGX inside docker-compose for multi-gpus MIG TAO Toolkit	5	1413	March 1, 2022
Getting Kubernetes ready for the NVIDIA A100 GPU with Multi-Instance GPU Technical Blog	4	691	November 8, 2022
TAO 4.0 Multi GPU Setup Question TAO Toolkit	5	426	July 19, 2023
How to use multi GPU training in tao-toolkit-api(K8s) TAO Toolkit	14	1067	May 19, 2023
TAO toolkit API with rtx3090 TAO Toolkit	5	633	July 18, 2022
TAO 4.0 AutoML - the provided PTX was compiled with an unsupported toolchain TAO Toolkit	6	697	July 17, 2023
TAO Toolkit - NVIDIA Tesla K80 TAO Toolkit	3	1310	October 14, 2021
Adding MIG, Preinstalled Drivers, and More to NVIDIA GPU Operator Technical Blog	1	459	July 7, 2021
TAO Toolkit 4.0 setup issue TAO Toolkit	19	2882	January 5, 2023
TAO Toolkit API 5.3.0 - Installed with errors TAO Toolkit	3	337	April 22, 2024

MIG-GPU Support in Kubernetes

Related topics