How to use multi GPU training in tao-toolkit-api(K8s)

• Hardware (RTX3080*2)
• Network Type (Classification)
• api-helm-version(tao-toolkit-api-4.0.0.tgz)

I want to know whether this toolkit supports multi-card parallel training or not?
if yes, how to change gpu’s setting in spec json file or what kind of limits on using?

We are using tao-toolkit-api for developing a AI training cluster, this is a very important question for our team, I desperately want to know the answer, thanks for yoyr reply

Yes, it supports. You can change numGpus: 1 in values.yaml.
Refer to Deployment - NVIDIA Docs

Hi thanks for your response , but I had changed the numGpus already but I still can’t use two gpu to run parallel training.
The attachment is the GPU usage I monitored when executing the training task and the kubectl pod description.
Is there any other config I should changed to used multi GPU run a single training task?

May I know which network did you run?

Hi here is my train.json in pvc, I’m running a classification with resnet

hi @iga00257 ,
could you please share the pods logs when you launch the training with commands below?

kubectl logs -f <training pod id>
kubectl describe pod <training pod id>
kubectl logs -f  tao-toolkit-api-workflow-pod-69b75
kubectl describe pod tao-toolkit-api-workflow-pod-69b75

the training pod id will be like :

are you running the getting_started_v4.0.0/notebooks/tao_api_starter_kit/api/automl/classification.ipynb or getting_started_v4.0.0/notebooks/tao_api_starter_kit/api/end2end/multiclass_classification.ipynb?

Here is pod describe
pod_descibe (4.3 KB)

I’m running

hi @iga00257
currently the end2end classification noteboook cannot support multi-gpu, and this feature will be enabled in TAO 5.0 soon.
You can try the autoML notebook which can launch training task with multi-gpu and give a better performance model with automatically selected hyperparameters.

Thanks for your response

I’d try the automl/classification.ipynb notebook
I confused about why every time I run this cell to create a job

It will create two pods, and one of the pod always get Error status then disappeared

and the monitor cell always stay on this response

here is the pod log of which status is Running,
Thank you for your patience~
pod_descibe (3.7 KB)

hi @iga00257
could you please share the pod logs, especially logs of the ERROR pod, you can use command such as

kubectl logs -f <error pod id>
kubectl logs -f tao-toolkit-api-workflow-pod

Hi there are no logs in Error pod

TAO-workflowpod_logs.txt (1.5 KB)

how did you install the tao-tookit-api, did you follow the bare-metal-setup or directly deployment using helm?

BTW, are there any files in the /mnt/nfs_share/default-tao-toolkit-api-pvc-pvc-<pvc-id>/users/<user-id>/models/<model-id>? Please upload these files so that we can get more information of this error.

