• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
I’ve been through an automl training cycle (object detection / detectnet_v2) and everything work. By default, it used just 1 of 2 available (3080 TI) GPUs. I want to use both GPUs. Per FAQ, my training set is large.
if you already have a cluster (I don’t), follow this helm process
However, if I follow the Bare Metal Setup, there is no config file that includes numGpu value.
Am I to install the cluster w/ Bare Metal Setup then re-install w/ this helm command?
Could you help to run $kubectl get pods
then, get the log from workflow pod, $ kubectl logs -f tao-toolkit-api-workflow-pod-xxxxx-xxxxx
then, check the command line when run autoML, for example,
My question isn’t “did it use multiple GPUs” - I know it used only 1 GPU at a time for each experiment. It was a default installation and I watched
nvidia-smi -l 1
during the training loop and it was obvious it did not. My question is more about not understanding the api deployment instructions (cited above).
The api deployment instructions read:
The following is used to deploy the TAO Toolkit API service on an existing Kubernetes cluster. You do not need these steps if you followed the previous Bare-Metal Setup or AWS EKS Setup.
I do not have an existing cluster. (I uninstalled it). If I don’t have an existing cluster, I am assuming the helm command will not work. However, it also says, 'you do not need these instructions if you followed these previsious Bare-Metal … "
Put another wasy do I:
a) use Bare-Metal installation steps (bash setup.sh install) to install the cluster & API services. AND THEN also (edit values.yml then) run the helm command
b) create the cluster using some other method, then run the helm command
After Bare-Metal installation steps (bash setup.sh install), it will use the default helm values. If anything on chart has to be changed, then please run the following commands.
helm fetch https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.2.tgz --username='$oauthtoken' --password=<YOUR API KEY>
mkdir tao-toolkit-api && tar -zxvf tao-toolkit-api-4.0.2.tgz -C tao-toolkit-api
# uninstall old tao-api
helm ls
helm delete tao-toolkit-api
#change tao-toolkit-api/values.yaml
numGpus:2
# re install tao-api
helm install tao-toolkit-api tao-toolkit-api/ --namespace default