How to use multi GPU training in tao-toolkit-api(K8s)

iga00257 · April 24, 2023, 1:55am

Please provide the following information when requesting support.

• Hardware (RTX3080*2)
• Network Type (Classification)
• api-helm-version(tao-toolkit-api-4.0.0.tgz)

I want to know whether this toolkit supports multi-card parallel training or not?
if yes, how to change gpu’s setting in spec json file or what kind of limits on using?

We are using tao-toolkit-api for developing a AI training cluster, this is a very important question for our team, I desperately want to know the answer, thanks for yoyr reply

Morganh · April 24, 2023, 5:15pm

Yes, it supports. You can change numGpus: 1 in values.yaml.
Refer to Deployment - NVIDIA Docs

iga00257 · April 25, 2023, 1:19am

Hi thanks for your response , but I had changed the numGpus already but I still can’t use two gpu to run parallel training.
The attachment is the GPU usage I monitored when executing the training task and the kubectl pod description.
Is there any other config I should changed to used multi GPU run a single training task?

Morganh · April 25, 2023, 2:02am

May I know which network did you run?

iga00257 · April 25, 2023, 5:48am

Hi here is my train.json in pvc, I’m running a classification with resnet

Bin_Zhao_NV · April 25, 2023, 9:19am

hi @iga00257 ,
could you please share the pods logs when you launch the training with commands below?

kubectl logs -f <training pod id>
kubectl describe pod <training pod id>
kubectl logs -f  tao-toolkit-api-workflow-pod-69b75
kubectl describe pod tao-toolkit-api-workflow-pod-69b75

the training pod id will be like :

Bin_Zhao_NV · April 25, 2023, 9:43am

are you running the getting_started_v4.0.0/notebooks/tao_api_starter_kit/api/automl/classification.ipynb or getting_started_v4.0.0/notebooks/tao_api_starter_kit/api/end2end/multiclass_classification.ipynb?

iga00257 · April 25, 2023, 9:46am

Here is pod describe
pod_descibe (4.3 KB)

iga00257 · April 25, 2023, 9:46am

Yes
I’m running
getting_started_v4.0.0/notebooks/tao_api_starter_kit/api/end2end/multiclass_classification.ipynb

Bin_Zhao_NV · April 26, 2023, 2:20am

hi @iga00257
currently the end2end classification noteboook cannot support multi-gpu, and this feature will be enabled in TAO 5.0 soon.
You can try the autoML notebook which can launch training task with multi-gpu and give a better performance model with automatically selected hyperparameters.

iga00257 · April 26, 2023, 5:42am

Thanks for your response

I’d try the automl/classification.ipynb notebook
I confused about why every time I run this cell to create a job

It will create two pods, and one of the pod always get Error status then disappeared

and the monitor cell always stay on this response

here is the pod log of which status is Running,
Thank you for your patience~
pod_descibe (3.7 KB)

Bin_Zhao_NV · May 8, 2023, 2:45am

hi @iga00257
could you please share the pod logs, especially logs of the ERROR pod, you can use command such as

kubectl logs -f <error pod id>
kubectl logs -f tao-toolkit-api-workflow-pod

iga00257 · May 19, 2023, 3:02am

Hi there are no logs in Error pod

TAO-workflowpod_logs.txt (1.5 KB)

Bin_Zhao_NV · May 19, 2023, 6:08am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

how did you install the tao-tookit-api, did you follow the bare-metal-setup or directly deployment using helm?

BTW, are there any files in the /mnt/nfs_share/default-tao-toolkit-api-pvc-pvc-<pvc-id>/users/<user-id>/models/<model-id>? Please upload these files so that we can get more information of this error.

system · June 20, 2023, 2:34am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TAO toolkit API with rtx3090 TAO Toolkit	5	633	July 18, 2022
TAO 4.0 Multi GPU Setup Question TAO Toolkit	5	426	July 19, 2023
Multi GPU AutoML TAO Toolkit	4	372	July 25, 2023
Error when training with multiple GPUs in TAO TAO Toolkit	17	2022	May 4, 2023
TAO 4.0 AutoML - the provided PTX was compiled with an unsupported toolchain TAO Toolkit	6	697	July 17, 2023
TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck TAO Toolkit	80	2260	October 11, 2023
Tao multiple - GPUs TAO Toolkit	6	889	February 8, 2022
AutoML training speed and GPU problem TAO Toolkit	28	1437	March 29, 2023
Training with multiple GPUs has error using TAO toolkit TAO Toolkit	17	1239	July 19, 2022
Train Pointpillar with Multi-GPU TAO Toolkit tao	11	2616	August 29, 2023

How to use multi GPU training in tao-toolkit-api(K8s)

Related topics