Multi GPU AutoML

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Classification
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) v4.0.2
• Training spec file(If have, please share here) classification_tf1 train
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

When running the AutoML notebook for 4.0.2 TAO API, even when I set numGpus to 3 in tao-toolkit-api/tao-toolkit-api/values.yaml, somehow only one GPU is used for the job. Instead what is happening is that there are 4 pods running and 2 of them are using 1 GPU each. The notebook shows the job id for only 1 of them, but both seem to be training the same experiment since the ETA per epoch hasn’t reduced for the experiment.

kubectl logs -f tao-toolkit-api-workflow-pod-55b9bfc948-5qzbq also shows the command only for the first job. I am adding the outputs of a few commands for reference.

My question is, how do I get 4 GPUs to run the same AutoML experiment for me?

Please refer to the solution in TAO 4.0 Multi GPU Setup Question - #4 by Morganh

So everytime I want to use a different number of GPUs, I’ll have to uninstall and reinstall with a different value? #change tao-toolkit-api/values.yaml
numGpus: [Enter new value everytime]

For TAO API 4.0.2, yes, while deploying you should change the gpu parameter in chart values yaml file, whatever number you set, how many gpus will be taken.

For TAO API 5.0, also yes ,but in addition to above, the user can have lesser number of gpus. And for each model in the notebook, we list the parameter that controls the gpu count. For example, in case you have 4gpus, you still have to change the values yaml but you can deploy with 4 gpus and use only 2 for training.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.