TAO5 - Detectnet_v2 - MultiGPU TAO-API Dead at train start

Please provide the following information when requesting support.

• Hardware: 2x RTXA6000ADA
• Network Type: Detectnet_v2
• TLT Version: 5.0.0

After deploy, login and lauch the new TAO5, start the process with a multiGPU.

Work with the same dataset used in the last version TAO4.

Login correctly - Get specs to convert datasets correctly - Convert datasets correctly - Good tfrecord generation - Create a new model_id correctly - Get specs to train correctly - Add my labels and include personal specs correctly - GOOD appearance of the train.json (better than the TAO4 alleluia) - Launch TRAIN !!! NOTHING HAPPENDS

$ kubectl logs -n gpu-operator tao-toolkit-api-app-pod-5cf97f4dc4-mt9dn

Adding trusted user: aca5e8b5-9d4c-52e0-a612-563bd387f382
172.16.1.2 - - [25/Jul/2023:12:46:46 +0000] "GET /api/v1/login/amdsZmo2YXV1dWhnaDgyYWlhc3Jkb252NWg6YmUxZmI4MTQtNGMwZi00NDk1LWJhMTUtYmM4Nzk4YjNlNWQz HTTP/1.1" 200 1167 "-" "python-requests/2.28.2"
172.16.1.2 - - [25/Jul/2023:12:48:02 +0000] "GET /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/dataset/5052fb99-fde5-4871-aabe-0f5f3b128503/specs/convert/schema HTTP/1.1" 200 3348 "-" "python-requests/2.28.2"
172.16.1.2 - - [25/Jul/2023:12:48:05 +0000] "GET /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/dataset/8d0886c8-82df-4fc0-99f8-963f964abfaa/specs/convert/schema HTTP/1.1" 200 3348 "-" "python-requests/2.28.2"
172.16.1.2 - - [25/Jul/2023:12:48:11 +0000] "POST /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/dataset/5052fb99-fde5-4871-aabe-0f5f3b128503/job HTTP/1.1" 201 117 "-" "python-requests/2.28.2"
172.16.1.2 - - [25/Jul/2023:12:49:35 +0000] "POST /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/dataset/8d0886c8-82df-4fc0-99f8-963f964abfaa/job HTTP/1.1" 201 117 "-" "python-requests/2.28.2"
172.16.1.2 - - [25/Jul/2023:12:56:25 +0000] "POST /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/model HTTP/1.1" 201 800 "-" "python-requests/2.28.2"
172.16.1.2 - - [25/Jul/2023:12:58:52 +0000] "GET /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/model/b37aba2c-aadc-43cf-a1fd-21c54f8437f3/specs/train/schema HTTP/1.1" 200 45210 "-" "python-requests/2.28.2"
172.16.1.2 - - [25/Jul/2023:13:08:58 +0000] "POST /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/model/b37aba2c-aadc-43cf-a1fd-21c54f8437f3/job HTTP/1.1" 201 117 "-" "python-requests/2.28.2"
$ kubectl logs -n gpu-operator tao-toolkit-api-workflow-pod-679984675f-v8k9g

NGC CLI 3.23.0
detectnet_v2 dataset_convert --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/afdf3fee-58a3-4bf5-8628-c78960eadf10/ --output_filename=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/tfrecords/tfrecords --verbose --dataset_export_spec=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/specs/afdf3fee-58a3-4bf5-8628-c78960eadf10.protobuf  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/logs/afdf3fee-58a3-4bf5-8628-c78960eadf10.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/logs/afdf3fee-58a3-4bf5-8628-c78960eadf10.txt; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/afdf3fee-58a3-4bf5-8628-c78960eadf10/ -type d | xargs chmod 777; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/afdf3fee-58a3-4bf5-8628-c78960eadf10/ -type f | xargs chmod 666 /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/afdf3fee-58a3-4bf5-8628-c78960eadf10/status.json
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
Job created afdf3fee-58a3-4bf5-8628-c78960eadf10
Post running
Toolkit status for afdf3fee-58a3-4bf5-8628-c78960eadf10 is SUCCESS
Job Done: afdf3fee-58a3-4bf5-8628-c78960eadf10 Final status: Done
detectnet_v2 dataset_convert --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/8d0886c8-82df-4fc0-99f8-963f964abfaa/379207ef-d69a-469f-8e9d-e3963f645f04/ --output_filename=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/8d0886c8-82df-4fc0-99f8-963f964abfaa/tfrecords/tfrecords --verbose --dataset_export_spec=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/8d0886c8-82df-4fc0-99f8-963f964abfaa/specs/379207ef-d69a-469f-8e9d-e3963f645f04.protobuf  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/8d0886c8-82df-4fc0-99f8-963f964abfaa/logs/379207ef-d69a-469f-8e9d-e3963f645f04.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/8d0886c8-82df-4fc0-99f8-963f964abfaa/logs/379207ef-d69a-469f-8e9d-e3963f645f04.txt; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/8d0886c8-82df-4fc0-99f8-963f964abfaa/379207ef-d69a-469f-8e9d-e3963f645f04/ -type d | xargs chmod 777; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/8d0886c8-82df-4fc0-99f8-963f964abfaa/379207ef-d69a-469f-8e9d-e3963f645f04/ -type f | xargs chmod 666 /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/8d0886c8-82df-4fc0-99f8-963f964abfaa/379207ef-d69a-469f-8e9d-e3963f645f04/status.json
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
Job created 379207ef-d69a-469f-8e9d-e3963f645f04
Post running
Toolkit status for 379207ef-d69a-469f-8e9d-e3963f645f04 is SUCCESS
Job Done: 379207ef-d69a-469f-8e9d-e3963f645f04 Final status: Done

Always appear normal. After launch the train process in the notebook, get the response with the theorical UUID of the Train process:

tao-client detectnet-v2 model-train --id b37aba2c-aadc-43cf-a1fd-21c54f8437f3
c77fa0b5-971e-4774-b591-0f787c51373b

Nothing happends. The api-app-pod register correctly the post

172.16.1.2 - - [25/Jul/2023:13:08:58 +0000] "POST /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/model/b37aba2c-aadc-43cf-a1fd-21c54f8437f3/job HTTP/1.1" 201 117 "-" "python-requests/2.28.2"

But at this point all the TAO Cluster DEAD.
The TAO API left to respond to the POST/GET requests.
And the POD lose the READY status:
gpu-operator tao-toolkit-api-app-pod-5cf97f4dc4-mt9dn 0/1 Running

Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 

Events:
  Type     Reason     Age                     From               Message
  ----     ------     ----                    ----               -------
  Normal   Scheduled  37m                     default-scheduler  Successfully assigned gpu-operator/tao-toolkit-api-app-pod-5cf97f4dc4-mt9dn to azken
  Normal   Pulling    37m                     kubelet            Pulling image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api"
  Normal   Pulled     37m                     kubelet            Successfully pulled image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api" in 3.010865548s (6.798282193s including waiting)
  Normal   Created    37m                     kubelet            Created container tao-toolkit-api-app
  Normal   Started    37m                     kubelet            Started container tao-toolkit-api-app
  Warning  Unhealthy  37m (x2 over 37m)       kubelet            Readiness probe failed: Get "http://192.168.99.71:8000/api/v1/health/readiness": dial tcp 192.168.99.71:8000: connect: connection refused
  Warning  Unhealthy  37m (x2 over 37m)       kubelet            Liveness probe failed: Get "http://192.168.99.71:8000/api/v1/health/liveness": dial tcp 192.168.99.71:8000: connect: connection refused
  Warning  Unhealthy  2m10s (x49 over 9m10s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 400

To get out of this state need to helm uninstall and reinstall again.
If try to get more information from the tao-toolkit-api respond with the next:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='10.1.1.10', port=31951): Max retries exceeded with url: /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/model/b37aba2c-aadc-43cf-a1fd-21c54f8437f3/job (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2adb6af7c0>: Failed to establish a new connection: [Errno 111] Connection refused'))

Pending on my side test with 1 GPU.

Could you check $kubectl get services ingress-nginx-controller ?

What do you want to check?

$ kubectl get services -A
NAMESPACE          NAME                                         TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
...
gpu-operator       tao-toolkit-api-jupyterlab-service           NodePort    10.96.10.67      <none>        8888:31952/TCP               17h
gpu-operator       tao-toolkit-api-service                      NodePort    10.108.13.110    <none>        8000:31951/TCP               17h
ingress-nginx      ingress-nginx-controller                     NodePort    10.101.69.80     <none>        80:32080/TCP,443:32443/TCP   28d
...
$ kubectl get pods -A
NAMESPACE          NAME                                                         READY   STATUS    RESTARTS       AGE
...
gpu-operator       tao-toolkit-api-app-pod-5cf97f4dc4-mt9dn                     0/1     Running   0              17h
gpu-operator       tao-toolkit-api-jupyterlab-pod-c54875474-mw45f               1/1     Running   0              17h
gpu-operator       tao-toolkit-api-workflow-pod-679984675f-v8k9g                1/1     Running   0              17h
ingress-nginx      ingress-nginx-controller-b4c5cd875-4lm6p                     1/1     Running   0              17h
...

Could you check the node_addr and node_port is set correctly?

# Define the node_addr and port number
node_addr = "<ip_address>" # FIXME2 example: 10.137.149.22
node_port = "<port_number>" # FIXME3 example: 32334
# In host machine, node ip_address and port number can be obtained as follows,
# ip_address: hostname -i
# port_number: kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'

Really are you asking this to me?

You read the post?

LOGIN - GENERATE TFRECODS - CREATE MODELS - LAUNCH THE TRAIN → CORRECTLY

Send to you the post/get received to the POD…

I will run below notebook of TAO API 5.0 to check if there is similar issue.
tao_api_starter_kit/client/object_detection.ipynb

I’m using the same notebook that the other post (sent by PM):

OK, will use the same notebook you shared. (detectnet_v2_tao5_clear.html)

Same behaviour with 1 GPU configured. Something missing? Can you share by PM the doc draft?

values.yaml
maxNumGpuPerNode: 1

train.json
gpus: 1

After restart the PODS, try again…

Send the Dataset convert action 3 consecutive times without response. And after that the liveness dead.

$ kubectl logs -n gpu-operator tao-toolkit-api-app-pod-786dcbc8d5-fslvk
172.16.1.2 - - [26/Jul/2023:09:46:20 +0000] "GET /api/v1/login/amdsZmo2YXV1dWhnaDgyYWlhc3Jkb252NWg6YmUxZmI4MTQtNGMwZi00NDk1LWJhMTUtYmM4Nzk4YjNlNWQz HTTP/1.1" 200 1167 "-" "python-requests/2.28.2"
172.16.1.2 - - [26/Jul/2023:09:46:25 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:46:25 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:46:35 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:46:35 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:46:39 +0000] "POST /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/dataset/5052fb99-fde5-4871-aabe-0f5f3b128503/job HTTP/1.1" 201 117 "-" "python-requests/2.28.2"
172.16.1.2 - - [26/Jul/2023:09:46:45 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:46:45 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:46:55 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:46:55 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:47:05 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:47:05 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:47:15 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:47:15 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:47:19 +0000] "POST /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/dataset/5052fb99-fde5-4871-aabe-0f5f3b128503/job HTTP/1.1" 201 117 "-" "python-requests/2.28.2"
172.16.1.2 - - [26/Jul/2023:09:47:25 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:47:25 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:47:35 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:47:35 +0000] "GET /api/v1/health/readiness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:47:45 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:47:45 +0000] "GET /api/v1/health/readiness HTTP/1.1" 400 87 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:47:51 +0000] "POST /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/dataset/5052fb99-fde5-4871-aabe-0f5f3b128503/job HTTP/1.1" 201 117 "-" "python-requests/2.28.2"
172.16.1.2 - - [26/Jul/2023:09:47:55 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:47:55 +0000] "GET /api/v1/health/readiness HTTP/1.1" 400 87 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:48:05 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:48:05 +0000] "GET /api/v1/health/readiness HTTP/1.1" 400 87 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:48:05 +0000] "GET /api/v1/health/readiness HTTP/1.1" 400 87 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:48:15 +0000] "GET /api/v1/health/liveness HTTP/1.1" 201 80 "-" "kube-probe/1.27"
172.16.1.2 - - [26/Jul/2023:09:48:15 +0000] "GET /api/v1/health/readiness HTTP/1.1" 400 87 "-" "kube-probe/1.27"
$ kubectl logs -n gpu-operator tao-toolkit-api-workflow-pod-7796b4d89b-t2s84
NGC CLI 3.23.0

Before installing TAO 5.0, did you use below to uninstall 4.0?
$ bash setup.sh uninstall

Morning Morganh, I install TAO using the Helm chart.
With Ansible always have troubles in the installation process.
I have residual containers in the server. But in theory is the advantage of the images… not fight between.

Hi,
I cannot reproduce the dead at train start. I just install the TAO 5.0 and launch the same notebook(tao_api_starter_kit/client/object_detection.ipynb). Currently it is running in experiment_0 without error.

Could you check if you can run nvidia-smi pod succesfully?
For example, $ kubectl exec nvidia-smi-x11-0011 -- nvidia-smi

Also, we recommend to run $ bash setup.sh uninstall to clean the old cluster to avoid unexpected conflict.

Thanks Morganh,

$ kubectl exec -n gpu-operator nvidia-smi-azken -- nvidia-smi
Thu Jul 27 10:42:29 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX 6000...  Off  | 00000000:21:00.0 Off |                  Off |
| 30%   32C    P8    23W / 300W |      0MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX 6000...  Off  | 00000000:22:00.0 Off |                  Off |
| 30%   36C    P8    28W / 300W |      0MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


try to do a clean and test again…

Also, please check with
$ kubectl get nodes
$ kubectl describe node <the_node_name>

Well, try to use uninstall from the TAO baremetal installer…After the cahos generated in the server (all related to nvidia uninstalled) I try to give a second oportunity to the ansible installer. As surprise works and install without extra problems.

But the behavior is exactly the same.

Don’t understand anything.

Please check with below to find some hints.
$ kubectl get nodes
$ kubectl describe node <the_node_name>

$ kubectl get nodes
NAME    STATUS   ROLES                  AGE    VERSION
azken   Ready    control-plane,master   133m   v1.23.5
$ kubectl describe node azken 
Name:               azken
Roles:              control-plane,master
Labels:             accelerator=
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.CETSS=true
                    feature.node.kubernetes.io/cpu-cpuid.CLZERO=true
                    feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8=true
                    feature.node.kubernetes.io/cpu-cpuid.CPBOOST=true
                    feature.node.kubernetes.io/cpu-cpuid.EFER_LMSLE_UNS=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.FP256=true
                    feature.node.kubernetes.io/cpu-cpuid.FSRM=true
                    feature.node.kubernetes.io/cpu-cpuid.FXSR=true
                    feature.node.kubernetes.io/cpu-cpuid.FXSROPT=true
                    feature.node.kubernetes.io/cpu-cpuid.IBPB=true
                    feature.node.kubernetes.io/cpu-cpuid.IBRS=true
                    feature.node.kubernetes.io/cpu-cpuid.IBRS_PREFERRED=true
                    feature.node.kubernetes.io/cpu-cpuid.IBRS_PROVIDES_SMP=true
                    feature.node.kubernetes.io/cpu-cpuid.IBS=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSBRNTRGT=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSFETCHSAM=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSFFV=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSOPCNT=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSOPCNTEXT=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSOPSAM=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSRDWROPCNT=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSRIPINVALIDCHK=true
                    feature.node.kubernetes.io/cpu-cpuid.IBS_FETCH_CTLX=true
                    feature.node.kubernetes.io/cpu-cpuid.IBS_OPFUSE=true
                    feature.node.kubernetes.io/cpu-cpuid.IBS_PREVENTHOST=true
                    feature.node.kubernetes.io/cpu-cpuid.INT_WBINVD=true
                    feature.node.kubernetes.io/cpu-cpuid.INVLPGB=true
                    feature.node.kubernetes.io/cpu-cpuid.LAHF=true
                    feature.node.kubernetes.io/cpu-cpuid.LBRVIRT=true
                    feature.node.kubernetes.io/cpu-cpuid.MCAOVERFLOW=true
                    feature.node.kubernetes.io/cpu-cpuid.MCOMMIT=true
                    feature.node.kubernetes.io/cpu-cpuid.MOVBE=true
                    feature.node.kubernetes.io/cpu-cpuid.MOVU=true
                    feature.node.kubernetes.io/cpu-cpuid.MSRIRC=true
                    feature.node.kubernetes.io/cpu-cpuid.MSR_PAGEFLUSH=true
                    feature.node.kubernetes.io/cpu-cpuid.NRIPS=true
                    feature.node.kubernetes.io/cpu-cpuid.OSXSAVE=true
                    feature.node.kubernetes.io/cpu-cpuid.PPIN=true
                    feature.node.kubernetes.io/cpu-cpuid.PSFD=true
                    feature.node.kubernetes.io/cpu-cpuid.RDPRU=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV_64BIT=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV_ALTERNATIVE=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV_DEBUGSWAP=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV_ES=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV_RESTRICTED=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV_SNP=true
                    feature.node.kubernetes.io/cpu-cpuid.SHA=true
                    feature.node.kubernetes.io/cpu-cpuid.SME=true
                    feature.node.kubernetes.io/cpu-cpuid.SME_COHERENT=true
                    feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD=true
                    feature.node.kubernetes.io/cpu-cpuid.SSE4A=true
                    feature.node.kubernetes.io/cpu-cpuid.STIBP=true
                    feature.node.kubernetes.io/cpu-cpuid.STIBP_ALWAYSON=true
                    feature.node.kubernetes.io/cpu-cpuid.SUCCOR=true
                    feature.node.kubernetes.io/cpu-cpuid.SVM=true
                    feature.node.kubernetes.io/cpu-cpuid.SVMDA=true
                    feature.node.kubernetes.io/cpu-cpuid.SVMFBASID=true
                    feature.node.kubernetes.io/cpu-cpuid.SVML=true
                    feature.node.kubernetes.io/cpu-cpuid.SVMNP=true
                    feature.node.kubernetes.io/cpu-cpuid.SVMPF=true
                    feature.node.kubernetes.io/cpu-cpuid.SVMPFT=true
                    feature.node.kubernetes.io/cpu-cpuid.SYSCALL=true
                    feature.node.kubernetes.io/cpu-cpuid.SYSEE=true
                    feature.node.kubernetes.io/cpu-cpuid.TLB_FLUSH_NESTED=true
                    feature.node.kubernetes.io/cpu-cpuid.TOPEXT=true
                    feature.node.kubernetes.io/cpu-cpuid.TSCRATEMSR=true
                    feature.node.kubernetes.io/cpu-cpuid.VAES=true
                    feature.node.kubernetes.io/cpu-cpuid.VMCBCLEAN=true
                    feature.node.kubernetes.io/cpu-cpuid.VMPL=true
                    feature.node.kubernetes.io/cpu-cpuid.VMSA_REGPROT=true
                    feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ=true
                    feature.node.kubernetes.io/cpu-cpuid.VTE=true
                    feature.node.kubernetes.io/cpu-cpuid.WBNOINVD=true
                    feature.node.kubernetes.io/cpu-cpuid.X87=true
                    feature.node.kubernetes.io/cpu-cpuid.XGETBV1=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVE=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVEC=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVES=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/cpu-model.family=25
                    feature.node.kubernetes.io/cpu-model.id=8
                    feature.node.kubernetes.io/cpu-model.vendor_id=AMD
                    feature.node.kubernetes.io/cpu-rdt.RDTCMT=true
                    feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMBM=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMON=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
                    feature.node.kubernetes.io/kernel-version.full=5.4.0-155-generic
                    feature.node.kubernetes.io/kernel-version.major=5
                    feature.node.kubernetes.io/kernel-version.minor=4
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/network-sriov.capable=true
                    feature.node.kubernetes.io/pci-10de.present=true
                    feature.node.kubernetes.io/pci-1a03.present=true
                    feature.node.kubernetes.io/pci-8086.present=true
                    feature.node.kubernetes.io/pci-8086.sriov.capable=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    feature.node.kubernetes.io/system-os_release.ID=ubuntu
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=20.04
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=20
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
                    feature.node.kubernetes.io/usb-ef_0414_f000.present=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=azken
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node-role.kubernetes.io/master=
                    node.kubernetes.io/exclude-from-external-load-balancers=
                    nvidia.com/cuda.driver.major=535
                    nvidia.com/cuda.driver.minor=54
                    nvidia.com/cuda.driver.rev=03
                    nvidia.com/cuda.runtime.major=12
                    nvidia.com/cuda.runtime.minor=2
                    nvidia.com/gfd.timestamp=1690539195
                    nvidia.com/gpu.compute.major=8
                    nvidia.com/gpu.compute.minor=9
                    nvidia.com/gpu.count=2
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=true
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.family=ampere
                    nvidia.com/gpu.machine=MC62-G40-00
                    nvidia.com/gpu.memory=49140
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-RTX-6000-Ada-Generation
                    nvidia.com/gpu.replicas=1
                    nvidia.com/mig.capable=false
                    nvidia.com/mig.strategy=single
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock
                    nfd.node.kubernetes.io/extended-resources: 
                    nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.CETSS,cpu-cpuid.CLZERO,cpu-cpuid.CMPXCHG8,cpu-cpuid.CPBOOST,cpu-cpuid...
                    nfd.node.kubernetes.io/master.version: v0.12.1
                    nfd.node.kubernetes.io/worker.version: v0.12.1
                    node.alpha.kubernetes.io/ttl: 0
                    nvidia.com/gpu-driver-upgrade-enabled: true
                    projectcalico.org/IPv4Address: 172.16.1.2/22
                    projectcalico.org/IPv4IPIPTunnelAddr: 192.168.35.64
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 28 Jul 2023 10:13:30 +0200
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  azken
  AcquireTime:     <unset>
  RenewTime:       Fri, 28 Jul 2023 12:27:05 +0200
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Fri, 28 Jul 2023 12:12:42 +0200   Fri, 28 Jul 2023 12:12:42 +0200   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Fri, 28 Jul 2023 12:23:29 +0200   Fri, 28 Jul 2023 10:13:29 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Fri, 28 Jul 2023 12:23:29 +0200   Fri, 28 Jul 2023 10:13:29 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Fri, 28 Jul 2023 12:23:29 +0200   Fri, 28 Jul 2023 10:13:29 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Fri, 28 Jul 2023 12:23:29 +0200   Fri, 28 Jul 2023 10:13:33 +0200   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.1.1.10
  Hostname:    azken
Capacity:
  cpu:                64
  ephemeral-storage:  1919479120Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263810844Ki
  nvidia.com/gpu:     2
  pods:               110
Allocatable:
  cpu:                64
  ephemeral-storage:  1768991954064
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263708444Ki
  nvidia.com/gpu:     2
  pods:               110
System Info:
  Machine ID:                 52565aa5cd9c4701ad72a87b4661ce44
  System UUID:                61df0000-9855-11ed-8000-74563c0fe5fe
  Boot ID:                    e8ebacdf-3360-4e1f-acfb-32cb18103a99
  Kernel Version:             5.4.0-155-generic
  OS Image:                   Ubuntu 20.04.6 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.2
  Kubelet Version:            v1.23.5
  Kube-Proxy Version:         v1.23.5
PodCIDR:                      192.168.32.0/24
PodCIDRs:                     192.168.32.0/24
Non-terminated Pods:          (22 in total)
  Namespace                   Name                                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                               ------------  ----------  ---------------  -------------  ---
  default                     ingress-nginx-controller-5ff6555d5d-knttq                          100m (0%)     0 (0%)      90Mi (0%)        0 (0%)         38m
  default                     nfs-subdir-external-provisioner-5886b45866-6jhj2                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         38m
  default                     tao-toolkit-api-app-pod-55c5d88d86-b5xfs                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         38m
  default                     tao-toolkit-api-jupyterlab-pod-5db94dd6cc-jvlsb                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         38m
  default                     tao-toolkit-api-workflow-pod-55db5b9bf9-b567x                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         38m
  kube-system                 calico-kube-controllers-7f76d48f74-pp9pl                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         38m
  kube-system                 calico-node-vpnm4                                                  250m (0%)     0 (0%)      0 (0%)           0 (0%)         133m
  kube-system                 coredns-64897985d-fjx2d                                            100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     38m
  kube-system                 coredns-64897985d-w97g6                                            100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     38m
  kube-system                 etcd-azken                                                         100m (0%)     0 (0%)      100Mi (0%)       0 (0%)         133m
  kube-system                 kube-apiserver-azken                                               250m (0%)     0 (0%)      0 (0%)           0 (0%)         133m
  kube-system                 kube-controller-manager-azken                                      200m (0%)     0 (0%)      0 (0%)           0 (0%)         133m
  kube-system                 kube-proxy-txqxm                                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         133m
  kube-system                 kube-scheduler-azken                                               100m (0%)     0 (0%)      0 (0%)           0 (0%)         133m
  nvidia-gpu-operator         gpu-feature-discovery-bfsp9                                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         132m
  nvidia-gpu-operator         gpu-operator-1690532037-node-feature-discovery-master-7f8brg27g    0 (0%)        0 (0%)      0 (0%)           0 (0%)         38m
  nvidia-gpu-operator         gpu-operator-1690532037-node-feature-discovery-worker-g9qbp        0 (0%)        0 (0%)      0 (0%)           0 (0%)         132m
  nvidia-gpu-operator         gpu-operator-5669df6dd6-g6cdb                                      200m (0%)     500m (0%)   100Mi (0%)       350Mi (0%)     38m
  nvidia-gpu-operator         nvidia-container-toolkit-daemonset-ghdfd                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         132m
  nvidia-gpu-operator         nvidia-dcgm-exporter-js2gz                                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         132m
  nvidia-gpu-operator         nvidia-device-plugin-daemonset-9hs2m                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         132m
  nvidia-gpu-operator         nvidia-operator-validator-vxwpz                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         132m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                1400m (2%)  500m (0%)
  memory             430Mi (0%)  690Mi (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
  nvidia.com/gpu     0           0
Events:
  Type     Reason                   Age                From        Message
  ----     ------                   ----               ----        -------
  Normal   Starting                 14m                kube-proxy  
  Normal   NodeNotSchedulable       38m                kubelet     Node azken status is now: NodeNotSchedulable
  Normal   NodeSchedulable          36m                kubelet     Node azken status is now: NodeSchedulable
  Warning  InvalidDiskCapacity      14m                kubelet     invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  14m (x8 over 14m)  kubelet     Node azken status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    14m (x7 over 14m)  kubelet     Node azken status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     14m (x7 over 14m)  kubelet     Node azken status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  14m                kubelet     Updated Node Allocatable limit across pods
  Normal   Starting                 14m                kubelet     Starting kubelet.

With the ansible deploy, after try to launch the train process, the POD never come back to live.

  Warning  Unhealthy         27m                    kubelet            Liveness probe failed: Get "http://192.168.35.97:8000/api/v1/health/liveness": dial tcp 192.168.35.97:8000: connect: connection refused
  Warning  Unhealthy         27m (x2 over 27m)      kubelet            Readiness probe failed: Get "http://192.168.35.97:8000/api/v1/health/readiness": dial tcp 192.168.35.97:8000: connect: connection refused
  Warning  Unhealthy         2m19s (x159 over 25m)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 400

Anything new¿¿??

Could you check why there is 535 driver on your side? Did you change the version when you run $bash setup.sh install ?

I cannot reproduce your issue. And after checking my node, I find that mine is

                nvidia.com/cuda.driver.major=525
                nvidia.com/cuda.driver.minor=85
                nvidia.com/cuda.driver.rev=12
                nvidia.com/cuda.runtime.major=12
                nvidia.com/cuda.runtime.minor=0

Yes, I need to have the Drivers installed in the system. I’m not the only user, and is necessary to other projects, that can’t be clusterized.

I think that the first time have the 525.125.06 version. I will try again with this version, if you are more confortable.