Please provide the following information when requesting support.
• Hardware (RTX3080*2)
• Network Type (Classification)
• api-helm-version(tao-toolkit-api-4.0.0.tgz)
I want to know whether this toolkit supports multi-card parallel training or not?
if yes, how to change gpu’s setting in spec json file or what kind of limits on using?
We are using tao-toolkit-api for developing a AI training cluster, this is a very important question for our team, I desperately want to know the answer, thanks for yoyr reply
Hi thanks for your response , but I had changed the numGpus already but I still can’t use two gpu to run parallel training.
The attachment is the GPU usage I monitored when executing the training task and the kubectl pod description.
Is there any other config I should changed to used multi GPU run a single training task?
hi @iga00257 ,
could you please share the pods logs when you launch the training with commands below?
kubectl logs -f <training pod id>
kubectl describe pod <training pod id>
kubectl logs -f tao-toolkit-api-workflow-pod-69b75
kubectl describe pod tao-toolkit-api-workflow-pod-69b75
are you running the getting_started_v4.0.0/notebooks/tao_api_starter_kit/api/automl/classification.ipynb or getting_started_v4.0.0/notebooks/tao_api_starter_kit/api/end2end/multiclass_classification.ipynb?
hi @iga00257
currently the end2end classification noteboook cannot support multi-gpu, and this feature will be enabled in TAO 5.0 soon.
You can try the autoML notebook which can launch training task with multi-gpu and give a better performance model with automatically selected hyperparameters.
There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks
BTW, are there any files in the /mnt/nfs_share/default-tao-toolkit-api-pvc-pvc-<pvc-id>/users/<user-id>/models/<model-id>? Please upload these files so that we can get more information of this error.