After start correctly the TAO TRAINING POD with ONE (1) GPU and watch the logs are correct, and are performing the training, start a new train using TWO (2) GPUS.
The TAO TRAINING POD with multiple GPU start correctly, set and load the datasets, and interpret correctly the specs. But not perform the train. Get stuck in the first steps of the training:
Attach the full log. 16f1496a-7eee-44f2-9eb4-1c93e4f9720c.txt (125.8 KB)
I stuck in this steps at least from 30 minutes or more. So I consider that the process is similar to my old issue with the TAO4: [TAO API - Detectnet_v2 - Multi GPU Stuck](https://TAO API - Detectnet_v2 - Multi GPU Stuck)
Suggestions?
As attempt to use all the “official” steps, the drivers are only in the Kubernetes cluster, so i don’t have the possibility to follow the GPU effort (nvtop), but i supose that the behavior is the same than before.
Did you config the gpus?
Please follow below step.
After Bare-Metal installation steps (bash setup.sh install), it will use the default helm values. If anything on chart has to be changed, then please run the following commands.
helm fetch https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-5.0.0.tgz --username='$oauthtoken' --password=<YOUR API KEY>
mkdir tao-toolkit-api && tar -zxvf tao-toolkit-api-5.0.0.tgz -C tao-toolkit-api
# uninstall old tao-api
helm ls
helm delete tao-toolkit-api
#change tao-toolkit-api/values.yaml
maxNumGpuPerNode:2 (Please set this to max gpus in your machine.)
# re install tao-api
helm install tao-toolkit-api tao-toolkit-api/ --namespace default
From above, currently, there is not nvidia-smi pod.
I still doubt there is still something mismatching when you uninstall and install TAO-API.
When you have time, please re-install and share with us the full log.
After installation, there should be a pod named nvidia-smi-xxxx. And you can run it for “nvidia-smi”.
c4ee8702-784f-470d-bb7b-65158f76a1c9/experiment_0/
INFO:tensorflow:Graph was finalized.
2023-08-03 13:03:32,760 [TAO Toolkit] [INFO] tensorflow 240: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-08-03 13:03:35,407 [TAO Toolkit] [INFO] tensorflow 500: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-08-03 13:03:36,031 [TAO Toolkit] [INFO] tensorflow 502: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-08-03 13:03:46,873 [TAO Toolkit] [INFO] tensorflow 81: Saving checkpoints for step-0.
[2023-08-03 13:05:40.195038: W /tmp/pip-install-gz1q68mo/horovod_94237439d5f64637a082acc92487fc68/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [DistributedAdamOptimizer_Allreduce/cond/HorovodAllreduce_mul_191_0, DistributedAdamOptimizer_Allreduce/cond_1/HorovodAllreduce_mul_192_0, DistributedAdamOptimizer_Allreduce/cond_10/HorovodAllreduce_mul_201_0, DistributedAdamOptimizer_Allreduce/cond_100/HorovodAllreduce_mul_291_0, DistributedAdamOptimizer_Allreduce/cond_101/HorovodAllreduce_mul_292_0, DistributedAdamOptimizer_Allreduce/cond_102/HorovodAllreduce_mul_293_0 ...]
1: [DistributedAdamOptimizer_Allreduce/cond_134/HorovodAllreduce_mul_389_0, DistributedAdamOptimizer_Allreduce/cond_135/HorovodAllreduce_mul_390_0, DistributedAdamOptimizer_Allreduce/cond_136/HorovodAllreduce_mul_391_0, DistributedAdamOptimizer_Allreduce/cond_137/HorovodAllreduce_mul_392_0, DistributedAdamOptimizer_Allreduce/cond_138/HorovodAllreduce_mul_393_0, DistributedAdamOptimizer_Allreduce/cond_139/HorovodAllreduce_mul_394_0 ...]
[2023-08-03 13:06:40.195721: W /tmp/pip-install-gz1q68mo/horovod_94237439d5f64637a082acc92487fc68/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [DistributedAdamOptimizer_Allreduce/cond/HorovodAllreduce_mul_191_0, DistributedAdamOptimizer_Allreduce/cond_1/HorovodAllreduce_mul_192_0, DistributedAdamOptimizer_Allreduce/cond_10/HorovodAllreduce_mul_201_0, DistributedAdamOptimizer_Allreduce/cond_100/HorovodAllreduce_mul_291_0, DistributedAdamOptimizer_Allreduce/cond_101/HorovodAllreduce_mul_292_0, DistributedAdamOptimizer_Allreduce/cond_102/HorovodAllreduce_mul_293_0 ...]
1: [DistributedAdamOptimizer_Allreduce/cond/HorovodAllreduce_mul_255_0, DistributedAdamOptimizer_Allreduce/cond_1/HorovodAllreduce_mul_256_0, DistributedAdamOptimizer_Allreduce/cond_10/HorovodAllreduce_mul_265_0, DistributedAdamOptimizer_Allreduce/cond_100/HorovodAllreduce_mul_355_0, DistributedAdamOptimizer_Allreduce/cond_101/HorovodAllreduce_mul_356_0, DistributedAdamOptimizer_Allreduce/cond_102/HorovodAllreduce_mul_357_0 ...]
I have the same result with automl and withou. With use_amp and without.
Using multiple GPU get stuck always in the same point. At the begining of the train process.
With only one (1) GPU are working in both situations.
For clarify: using the API tao, deployed in the Kubernetes cluster using the setup.sh (Ansible) with all the suggested steps. Re-installed from scratch. And modify to include the multi GPU.