Please provide the following information when requesting support.
• Hardware: 2x RTXA6000ADA
• Network Type: Detectnet_v2
• TLT Version: 5.0.0
After deploy, login and lauch the new TAO5, start the process with a multiGPU.
Work with the same dataset used in the last version TAO4.
Login correctly - Get specs to convert datasets correctly - Convert datasets correctly - Good tfrecord generation - Create a new model_id correctly - Get specs to train correctly - Add my labels and include personal specs correctly - GOOD appearance of the train.json (better than the TAO4 alleluia
) - Launch TRAIN !!! NOTHING HAPPENDS
$ kubectl logs -n gpu-operator tao-toolkit-api-app-pod-5cf97f4dc4-mt9dn
Adding trusted user: aca5e8b5-9d4c-52e0-a612-563bd387f382
172.16.1.2 - - [25/Jul/2023:12:46:46 +0000] "GET /api/v1/login/amdsZmo2YXV1dWhnaDgyYWlhc3Jkb252NWg6YmUxZmI4MTQtNGMwZi00NDk1LWJhMTUtYmM4Nzk4YjNlNWQz HTTP/1.1" 200 1167 "-" "python-requests/2.28.2"
172.16.1.2 - - [25/Jul/2023:12:48:02 +0000] "GET /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/dataset/5052fb99-fde5-4871-aabe-0f5f3b128503/specs/convert/schema HTTP/1.1" 200 3348 "-" "python-requests/2.28.2"
172.16.1.2 - - [25/Jul/2023:12:48:05 +0000] "GET /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/dataset/8d0886c8-82df-4fc0-99f8-963f964abfaa/specs/convert/schema HTTP/1.1" 200 3348 "-" "python-requests/2.28.2"
172.16.1.2 - - [25/Jul/2023:12:48:11 +0000] "POST /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/dataset/5052fb99-fde5-4871-aabe-0f5f3b128503/job HTTP/1.1" 201 117 "-" "python-requests/2.28.2"
172.16.1.2 - - [25/Jul/2023:12:49:35 +0000] "POST /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/dataset/8d0886c8-82df-4fc0-99f8-963f964abfaa/job HTTP/1.1" 201 117 "-" "python-requests/2.28.2"
172.16.1.2 - - [25/Jul/2023:12:56:25 +0000] "POST /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/model HTTP/1.1" 201 800 "-" "python-requests/2.28.2"
172.16.1.2 - - [25/Jul/2023:12:58:52 +0000] "GET /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/model/b37aba2c-aadc-43cf-a1fd-21c54f8437f3/specs/train/schema HTTP/1.1" 200 45210 "-" "python-requests/2.28.2"
172.16.1.2 - - [25/Jul/2023:13:08:58 +0000] "POST /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/model/b37aba2c-aadc-43cf-a1fd-21c54f8437f3/job HTTP/1.1" 201 117 "-" "python-requests/2.28.2"
$ kubectl logs -n gpu-operator tao-toolkit-api-workflow-pod-679984675f-v8k9g
NGC CLI 3.23.0
detectnet_v2 dataset_convert --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/afdf3fee-58a3-4bf5-8628-c78960eadf10/ --output_filename=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/tfrecords/tfrecords --verbose --dataset_export_spec=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/specs/afdf3fee-58a3-4bf5-8628-c78960eadf10.protobuf > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/logs/afdf3fee-58a3-4bf5-8628-c78960eadf10.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/logs/afdf3fee-58a3-4bf5-8628-c78960eadf10.txt; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/afdf3fee-58a3-4bf5-8628-c78960eadf10/ -type d | xargs chmod 777; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/afdf3fee-58a3-4bf5-8628-c78960eadf10/ -type f | xargs chmod 666 /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/5052fb99-fde5-4871-aabe-0f5f3b128503/afdf3fee-58a3-4bf5-8628-c78960eadf10/status.json
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
Job created afdf3fee-58a3-4bf5-8628-c78960eadf10
Post running
Toolkit status for afdf3fee-58a3-4bf5-8628-c78960eadf10 is SUCCESS
Job Done: afdf3fee-58a3-4bf5-8628-c78960eadf10 Final status: Done
detectnet_v2 dataset_convert --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/8d0886c8-82df-4fc0-99f8-963f964abfaa/379207ef-d69a-469f-8e9d-e3963f645f04/ --output_filename=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/8d0886c8-82df-4fc0-99f8-963f964abfaa/tfrecords/tfrecords --verbose --dataset_export_spec=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/8d0886c8-82df-4fc0-99f8-963f964abfaa/specs/379207ef-d69a-469f-8e9d-e3963f645f04.protobuf > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/8d0886c8-82df-4fc0-99f8-963f964abfaa/logs/379207ef-d69a-469f-8e9d-e3963f645f04.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/8d0886c8-82df-4fc0-99f8-963f964abfaa/logs/379207ef-d69a-469f-8e9d-e3963f645f04.txt; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/8d0886c8-82df-4fc0-99f8-963f964abfaa/379207ef-d69a-469f-8e9d-e3963f645f04/ -type d | xargs chmod 777; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/8d0886c8-82df-4fc0-99f8-963f964abfaa/379207ef-d69a-469f-8e9d-e3963f645f04/ -type f | xargs chmod 666 /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/8d0886c8-82df-4fc0-99f8-963f964abfaa/379207ef-d69a-469f-8e9d-e3963f645f04/status.json
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
Job created 379207ef-d69a-469f-8e9d-e3963f645f04
Post running
Toolkit status for 379207ef-d69a-469f-8e9d-e3963f645f04 is SUCCESS
Job Done: 379207ef-d69a-469f-8e9d-e3963f645f04 Final status: Done
Always appear normal. After launch the train process in the notebook, get the response with the theorical UUID of the Train process:
tao-client detectnet-v2 model-train --id b37aba2c-aadc-43cf-a1fd-21c54f8437f3
c77fa0b5-971e-4774-b591-0f787c51373b
Nothing happends. The api-app-pod
register correctly the post
172.16.1.2 - - [25/Jul/2023:13:08:58 +0000] "POST /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/model/b37aba2c-aadc-43cf-a1fd-21c54f8437f3/job HTTP/1.1" 201 117 "-" "python-requests/2.28.2"
But at this point all the TAO Cluster DEAD.
The TAO API left to respond to the POST/GET requests.
And the POD lose the READY status:
gpu-operator tao-toolkit-api-app-pod-5cf97f4dc4-mt9dn 0/1 Running
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 37m default-scheduler Successfully assigned gpu-operator/tao-toolkit-api-app-pod-5cf97f4dc4-mt9dn to azken
Normal Pulling 37m kubelet Pulling image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api"
Normal Pulled 37m kubelet Successfully pulled image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api" in 3.010865548s (6.798282193s including waiting)
Normal Created 37m kubelet Created container tao-toolkit-api-app
Normal Started 37m kubelet Started container tao-toolkit-api-app
Warning Unhealthy 37m (x2 over 37m) kubelet Readiness probe failed: Get "http://192.168.99.71:8000/api/v1/health/readiness": dial tcp 192.168.99.71:8000: connect: connection refused
Warning Unhealthy 37m (x2 over 37m) kubelet Liveness probe failed: Get "http://192.168.99.71:8000/api/v1/health/liveness": dial tcp 192.168.99.71:8000: connect: connection refused
Warning Unhealthy 2m10s (x49 over 9m10s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 400
To get out of this state need to helm uninstall
and reinstall again.
If try to get more information from the tao-toolkit-api respond with the next:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='10.1.1.10', port=31951): Max retries exceeded with url: /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/model/b37aba2c-aadc-43cf-a1fd-21c54f8437f3/job (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2adb6af7c0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Pending on my side test with 1 GPU.