TAO 4.0 Multi GPU Setup Question

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
AMD64, Ubuntu 20.04, (2) RTX 3080 TIs

• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I’ve been through an automl training cycle (object detection / detectnet_v2) and everything work. By default, it used just 1 of 2 available (3080 TI) GPUs. I want to use both GPUs. Per FAQ, my training set is large.

You already answered my multiple GPU question elsewhere (thanks) but I need you to dumb it down one more level. (I’ve never used helm) I read:
https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_deployment.html

I’ve already uninstalled the API and cluster ($ bash setup.sh uninstall)
I want to re-install with numGpu=2

I don’t have a K8 cluster (I uninstalled it) - the instructions:
https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_deployment.html
say:

  • if you already have a cluster (I don’t), follow this helm process

However, if I follow the Bare Metal Setup, there is no config file that includes numGpu value.
Am I to install the cluster w/ Bare Metal Setup then re-install w/ this helm command?

Could you help to run
$kubectl get pods
then, get the log from workflow pod,
$ kubectl logs -f tao-toolkit-api-workflow-pod-xxxxx-xxxxx
then, check the command line when run autoML, for example,

detectnet_v2 train --gpus $NUM_GPUS -e /shared/users/b70e4f00-446b-43c3-affe-1ae4d4b7c1de/models/9096d9b7-0c34-412e-8fde-43d1643e55a0/3f054bfe-0258-4630-beff-354347920436/recommendation_19.kitti -r /shared/users/b70e4f00-446b-43c3-affe-1ae4d4b7c1de/models/9096d9b7-0c34-412e-8fde-43d1643e55a0/3f054bfe-0258-4630-beff-354347920436/experiment_19 -k tlt_encode > /shared/users/b70e4f00-446b-43c3-affe-1ae4d4b7c1de/models/9096d9b7-0c34-412e-8fde-43d1643e55a0/3f054bfe-0258-4630-beff-354347920436/experiment_19/log.txt 2>&1 >> /shared/users/b70e4f00-446b-43c3-affe-1ae4d4b7c1de/models/9096d9b7-0c34-412e-8fde-43d1643e55a0/3f054bfe-0258-4630-beff-354347920436/experiment_19/log.txt

Then share the experiment_xx/log.txt with us.

More, can you share the result of
$ kubectl describe nodes | grep -A 6 Capacity

My question isn’t “did it use multiple GPUs” - I know it used only 1 GPU at a time for each experiment. It was a default installation and I watched

nvidia-smi -l 1

during the training loop and it was obvious it did not. My question is more about not understanding the api deployment instructions (cited above).

The api deployment instructions read:

The following is used to deploy the TAO Toolkit API service on an existing Kubernetes cluster. You do not need these steps if you followed the previous Bare-Metal Setup or AWS EKS Setup.

I do not have an existing cluster. (I uninstalled it). If I don’t have an existing cluster, I am assuming the helm command will not work. However, it also says, 'you do not need these instructions if you followed these previsious Bare-Metal … "

Put another wasy do I:
a) use Bare-Metal installation steps (bash setup.sh install) to install the cluster & API services. AND THEN also (edit values.yml then) run the helm command
b) create the cluster using some other method, then run the helm command

After Bare-Metal installation steps (bash setup.sh install), it will use the default helm values. If anything on chart has to be changed, then please run the following commands.

helm fetch https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.2.tgz --username='$oauthtoken' --password=<YOUR API KEY>
mkdir tao-toolkit-api && tar -zxvf tao-toolkit-api-4.0.2.tgz -C tao-toolkit-api
# uninstall old tao-api
helm ls
helm delete tao-toolkit-api

#change tao-toolkit-api/values.yaml
numGpus:2

# re install tao-api
helm install tao-toolkit-api tao-toolkit-api/ --namespace default

Thanks!
That answers my question exactly - it makes far more sense now.
And, it appears to be working correctly.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.