• Hardware: 2x RTXA6000ADA
• Network Type: Detectnet_v2 - PeopleNet as transfer learning
• TLT Version: 5.0.0 API Kubernetes
After finally works with multiGPU start the AUTOML process with Hyperband configuration.
All looks good this night. After 26 experiments, the AutoML start a “new” experiment in a random number, in this case 11
, as the main character of StrangerThigs. Discovered by the write date property in the experiment files in the server. And the train process get Freeze/Stuck with a error in the experiment 11.
As summary, I don’t know WHY after finish the experiment 11 and reach the experiment 26, start again with the experiment 11. I don’t know WHY the spec file generated to the second time that start the experiment 11 hadn’t the pretrained_model_file:
configured.
Attach Hyperband configuration:
"automl_enabled": true,
"automl_algorithm": "Hyperband",
"metric": "loss",
"automl_add_hyperparameters": "[]",
"automl_remove_hyperparameters": "[]",
"epoch_multiplier": 10,
"automl_R": 27,
"automl_nu": 3
Attach logs from the Workflow-pod:
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/d58662ca-7674-4b59-b1d8-d1f1a9a5bba6/recommendation_0.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/d58662ca-7674-4b59-b1d8-d1f1a9a5bba6/experiment_0/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/d58662ca-7674-4b59-b1d8-d1f1a9a5bba6/experiment_0/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/d58662ca-7674-4b59-b1d8-d1f1a9a5bba6/experiment_0/log.txt
AutoML recommendation with experiment id 0 and job id d2d2dbd5-7195-4dfc-b980-492c71746053 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/d421078d-bd13-49bd-87f6-8636e5bb6f0c/recommendation_0.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/d421078d-bd13-49bd-87f6-8636e5bb6f0c/experiment_0/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/d421078d-bd13-49bd-87f6-8636e5bb6f0c/experiment_0/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/d421078d-bd13-49bd-87f6-8636e5bb6f0c/experiment_0/log.txt
AutoML recommendation with experiment id 0 and job id 09db6c35-4797-4e08-ace9-8873641cf6ee submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_0.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_0/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_0/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_0/log.txt
AutoML recommendation with experiment id 0 and job id 3498bde3-f282-4973-b44c-02cb401f5538 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_1.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_1/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_1/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_1/log.txt
AutoML recommendation with experiment id 1 and job id 92aa8359-ecdd-47b0-be1b-e470b718fb51 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_2.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_2/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_2/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_2/log.txt
AutoML recommendation with experiment id 2 and job id 9266c02b-1605-41c9-9074-ae0616aadd6d submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_3.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_3/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_3/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_3/log.txt
AutoML recommendation with experiment id 3 and job id b5cedd49-607b-482f-bf13-e4aa1d109720 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_4.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_4/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_4/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_4/log.txt
AutoML recommendation with experiment id 4 and job id bfde6b5b-93ed-422f-87e7-6e04cb4a6d58 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_5.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_5/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_5/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_5/log.txt
AutoML recommendation with experiment id 5 and job id 7ecd264f-0ec1-4cf2-9e90-cf1c30458dee submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_6.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_6/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_6/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_6/log.txt
AutoML recommendation with experiment id 6 and job id 1d750111-2d6d-44c6-bdd1-3da031e4db6e submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_7.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_7/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_7/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_7/log.txt
AutoML recommendation with experiment id 7 and job id 7f35195b-c351-4db0-b852-fe12a2b2bcfc submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_8.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_8/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_8/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_8/log.txt
AutoML recommendation with experiment id 8 and job id 59bf549e-5d9e-49b6-ba9b-e092c6258ca0 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_9.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_9/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_9/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_9/log.txt
AutoML recommendation with experiment id 9 and job id a253d250-31a5-4c36-8d46-13199ac7fd03 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_10.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_10/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_10/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_10/log.txt
AutoML recommendation with experiment id 10 and job id 8f7854a7-1eeb-4e55-bdaa-fa14e25f5c99 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_11.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_11/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_11/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_11/log.txt
AutoML recommendation with experiment id 11 and job id c63797b5-5c1c-45b9-b706-ea1dec3eb6ee submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_12.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_12/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_12/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_12/log.txt
AutoML recommendation with experiment id 12 and job id 238ccf56-4fdc-4439-bbcf-52b46867c088 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_13.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_13/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_13/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_13/log.txt
AutoML recommendation with experiment id 13 and job id a05c9b1a-b011-4886-a391-119c46e0cfaa submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_14.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_14/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_14/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_14/log.txt
AutoML recommendation with experiment id 14 and job id aac9dc43-e23f-48cb-9b95-148b0fcd6ace submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_15.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_15/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_15/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_15/log.txt
AutoML recommendation with experiment id 15 and job id b500a649-c550-47e6-99de-509c98b2bb7d submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_16.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_16/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_16/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_16/log.txt
AutoML recommendation with experiment id 16 and job id f97cea44-f58f-4830-832c-bb825deab0f1 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_17.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_17/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_17/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_17/log.txt
AutoML recommendation with experiment id 17 and job id db4c0f98-c3df-447a-9244-ac09060ae8cc submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_18.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_18/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_18/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_18/log.txt
AutoML recommendation with experiment id 18 and job id d137729e-373c-4a34-ad3d-e90e976215ac submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_19.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_19/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_19/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_19/log.txt
AutoML recommendation with experiment id 19 and job id 3cdf44af-c0ef-4c08-99e5-acdd34d89c34 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_20.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_20/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_20/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_20/log.txt
AutoML recommendation with experiment id 20 and job id 3f3b3fc7-e32d-4964-8e26-27f79748d258 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_21.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_21/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_21/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_21/log.txt
AutoML recommendation with experiment id 21 and job id aa22a938-a967-4cef-9512-f1598496bbf4 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_22.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_22/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_22/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_22/log.txt
AutoML recommendation with experiment id 22 and job id 0eaf28de-ca90-4637-a885-03bc7afb30b1 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_23.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_23/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_23/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_23/log.txt
AutoML recommendation with experiment id 23 and job id 1c14fb12-43d3-4387-830a-9abfb7823a9f submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_24.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_24/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_24/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_24/log.txt
AutoML recommendation with experiment id 24 and job id e6269f25-b6b9-4f0c-ae93-50774ae2e8e3 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_25.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_25/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_25/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_25/log.txt
AutoML recommendation with experiment id 25 and job id a310808b-83ff-447d-8927-be69dccf3979 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_26.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_26/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_26/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_26/log.txt
AutoML recommendation with experiment id 26 and job id 59ac2817-fda1-45c0-b49b-7e723fa87fc2 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_11.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_11/ --key=tlt_encode --gpus=2 --use_amp > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_11/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_11/log.txt
AutoML recommendation with experiment id 11 and job id c63797b5-5c1c-45b9-b706-ea1dec3eb6ee submitted
Log from the AUTOML job pod generated:
kubectl logs a0fc3edc-28fb-4e7f-95ce-e2338602f12f-8rr5q
NGC CLI 3.23.0
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.0003961567999795079 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T11:00:43Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.00016320364375133067 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T11:39:17Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.011115627363324165 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T12:12:01Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.012782877311110497 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T12:37:09Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.012704013846814632 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T13:15:44Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.012742005288600922 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T13:54:36Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.012899148277938366 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T14:33:38Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.0005173633107915521 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T15:11:25Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.0004279972636140883 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T15:49:14Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.00520962942391634 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T16:28:02Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.00028941070195287466 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T17:06:23Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.00015078166325110942 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T17:45:25Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.004880583845078945 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T18:06:14Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.012657305225729942 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T18:44:02Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.004696693271398544 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T19:23:21Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.012759639881551266 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T19:59:51Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.004692853428423405 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T20:38:11Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.0050573041662573814 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T21:15:13Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.012663132511079311 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T21:54:17Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.00020110807963646948 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T22:33:20Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.012810888700187206 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T23:10:07Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.00042211421532556415 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T23:48:12Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.005165629554539919 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-10T00:27:18Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.00039220356848090887 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-10T01:06:23Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.004880324471741915 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-10T01:45:13Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.00017470787861384451 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-10T02:23:15Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.005112297832965851 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-10T02:59:33Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
So attach the Data generated inside this experiment 11.
log_exp11.txt (9.0 KB)
recommendation_11.protobuf (10.3 KB)
All the experiments .protobuf
have included the pretrained_model_file:
except the Experiment 11
.
After this discovery. Sent a job/cancel
message to the API. Terminate correctly, and try to restart the AutoML training.
The new training pod start in error
with this message:
tkeic@azken:~$ kubectl logs a0fc3edc-28fb-4e7f-95ce-e2338602f12f-xd4hd
Traceback (most recent call last):
File "/opt/api/automl_start.py", line 123, in <module>
automl_start(
File "/opt/api/automl_start.py", line 35, in automl_start
controller.start()
File "/opt/api/automl/controller.py", line 130, in start
self._execute_loop()
File "/opt/api/automl/controller.py", line 203, in _execute_loop
self.run_experiments()
File "/opt/api/automl/controller.py", line 216, in run_experiments
recommended_specs = self.brain.generate_recommendations(history)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/api/automl/hyperband.py", line 269, in generate_recommendations
if history[self.track_id].status not in [JobStates.success, JobStates.failure]:
^^^^^^^^^^^^^
AttributeError: 'HyperBand' object has no attribute 'track_id'
I’m the only one that can reach any goal with this SW?