Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Classification
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) v4.0.2
• Training spec file(If have, please share here) classification_tf1 train
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
When running the AutoML notebook for 4.0.2 TAO API, even when I set numGpus to 3 in tao-toolkit-api/tao-toolkit-api/values.yaml, somehow only one GPU is used for the job. Instead what is happening is that there are 4 pods running and 2 of them are using 1 GPU each. The notebook shows the job id for only 1 of them, but both seem to be training the same experiment since the ETA per epoch hasn’t reduced for the experiment.
kubectl logs -f tao-toolkit-api-workflow-pod-55b9bfc948-5qzbq
also shows the command only for the first job. I am adding the outputs of a few commands for reference.
My question is, how do I get 4 GPUs to run the same AutoML experiment for me?