Is it possible to modify the code to support MIG-GPU in K8S deployment?
Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
We checked some codes inside the container folder /opt/api. Only ‘nvidia.com/gpu’ is referred. So, I wonder if we could manually add ‘nvidia.com/mig-xx.xxxx’ to support MIG-GPU.
Two containers(one is xxx-api-pod & one is xxx-workflow-pod) are deployed from same image nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-beta-api by the Helm chart. We only found /opt/api/job_utils/executor.py contains ‘nvidia.com/gpu’. However, even we modified them to ‘MIG-GPU label’, ‘nvidia.com/gpu’ is still declared in the TAO created job.
The tao-toolkit-api is successfully deployed. The two pods and service are running normally. However, the worker node is configured to run MIG. We couldn’t create any job as the pod is pending due to resource limit(i.e. “Insufficient nvidia.com/gpu”).