Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc)
AMD64, Ubuntu 20.04, (2) RTX 3080 TIs
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
I’ve been through an automl training cycle (object detection / detectnet_v2) and everything work. By default, it used just 1 of 2 available (3080 TI) GPUs. I want to use both GPUs. Per FAQ, my training set is large.
You already answered my multiple GPU question elsewhere (thanks) but I need you to dumb it down one more level. (I’ve never used helm) I read:
I’ve already uninstalled the API and cluster ($ bash setup.sh uninstall)
I want to re-install with numGpu=2
I don’t have a K8 cluster (I uninstalled it) - the instructions:
- if you already have a cluster (I don’t), follow this helm process
However, if I follow the Bare Metal Setup, there is no config file that includes numGpu value.
Am I to install the cluster w/ Bare Metal Setup then re-install w/ this helm command?
Could you help to run
$kubectl get pods
then, get the log from workflow pod,
$ kubectl logs -f tao-toolkit-api-workflow-pod-xxxxx-xxxxx
then, check the command line when run autoML, for example,
detectnet_v2 train --gpus $NUM_GPUS -e /shared/users/b70e4f00-446b-43c3-affe-1ae4d4b7c1de/models/9096d9b7-0c34-412e-8fde-43d1643e55a0/3f054bfe-0258-4630-beff-354347920436/recommendation_19.kitti -r /shared/users/b70e4f00-446b-43c3-affe-1ae4d4b7c1de/models/9096d9b7-0c34-412e-8fde-43d1643e55a0/3f054bfe-0258-4630-beff-354347920436/experiment_19 -k tlt_encode > /shared/users/b70e4f00-446b-43c3-affe-1ae4d4b7c1de/models/9096d9b7-0c34-412e-8fde-43d1643e55a0/3f054bfe-0258-4630-beff-354347920436/experiment_19/log.txt 2>&1 >> /shared/users/b70e4f00-446b-43c3-affe-1ae4d4b7c1de/models/9096d9b7-0c34-412e-8fde-43d1643e55a0/3f054bfe-0258-4630-beff-354347920436/experiment_19/log.txt
Then share the
experiment_xx/log.txt with us.
More, can you share the result of
$ kubectl describe nodes | grep -A 6 Capacity
My question isn’t “did it use multiple GPUs” - I know it used only 1 GPU at a time for each experiment. It was a default installation and I watched
nvidia-smi -l 1
during the training loop and it was obvious it did not. My question is more about not understanding the api deployment instructions (cited above).
The api deployment instructions read:
The following is used to deploy the TAO Toolkit API service on an existing Kubernetes cluster. You do not need these steps if you followed the previous Bare-Metal Setup or AWS EKS Setup.
I do not have an existing cluster. (I uninstalled it). If I don’t have an existing cluster, I am assuming the helm command will not work. However, it also says, 'you do not need these instructions if you followed these previsious Bare-Metal … "
Put another wasy do I:
a) use Bare-Metal installation steps (bash setup.sh install) to install the cluster & API services. AND THEN also (edit values.yml then) run the helm command
b) create the cluster using some other method, then run the helm command
After Bare-Metal installation steps (bash setup.sh install), it will use the default helm values. If anything on chart has to be changed, then please run the following commands.
helm fetch https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.2.tgz --username='$oauthtoken' --password=<YOUR API KEY>
mkdir tao-toolkit-api && tar -zxvf tao-toolkit-api-4.0.2.tgz -C tao-toolkit-api
# uninstall old tao-api
helm delete tao-toolkit-api
# re install tao-api
helm install tao-toolkit-api tao-toolkit-api/ --namespace default
That answers my question exactly - it makes far more sense now.
And, it appears to be working correctly.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.