TAO5 - Detectnet_v2 - MultiGPU AutoML StrangerThings

• Hardware: 2x RTXA6000ADA
• Network Type: Detectnet_v2 - PeopleNet as transfer learning
• TLT Version: 5.0.0 API Kubernetes

After finally works with multiGPU start the AUTOML process with Hyperband configuration.

All looks good this night. After 26 experiments, the AutoML start a “new” experiment in a random number, in this case 11, as the main character of StrangerThigs. Discovered by the write date property in the experiment files in the server. And the train process get Freeze/Stuck with a error in the experiment 11.

As summary, I don’t know WHY after finish the experiment 11 and reach the experiment 26, start again with the experiment 11. I don’t know WHY the spec file generated to the second time that start the experiment 11 hadn’t the pretrained_model_file: configured.

Attach Hyperband configuration:

    "automl_enabled": true,
    "automl_algorithm": "Hyperband",
    "metric": "loss",
    "automl_add_hyperparameters": "[]",
    "automl_remove_hyperparameters": "[]",
    "epoch_multiplier": 10,
    "automl_R": 27,
    "automl_nu": 3

Attach logs from the Workflow-pod:

Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/d58662ca-7674-4b59-b1d8-d1f1a9a5bba6/recommendation_0.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/d58662ca-7674-4b59-b1d8-d1f1a9a5bba6/experiment_0/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/d58662ca-7674-4b59-b1d8-d1f1a9a5bba6/experiment_0/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/d58662ca-7674-4b59-b1d8-d1f1a9a5bba6/experiment_0/log.txt
AutoML recommendation with experiment id 0 and job id d2d2dbd5-7195-4dfc-b980-492c71746053 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/d421078d-bd13-49bd-87f6-8636e5bb6f0c/recommendation_0.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/d421078d-bd13-49bd-87f6-8636e5bb6f0c/experiment_0/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/d421078d-bd13-49bd-87f6-8636e5bb6f0c/experiment_0/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/d421078d-bd13-49bd-87f6-8636e5bb6f0c/experiment_0/log.txt
AutoML recommendation with experiment id 0 and job id 09db6c35-4797-4e08-ace9-8873641cf6ee submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_0.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_0/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_0/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_0/log.txt
AutoML recommendation with experiment id 0 and job id 3498bde3-f282-4973-b44c-02cb401f5538 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_1.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_1/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_1/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_1/log.txt
AutoML recommendation with experiment id 1 and job id 92aa8359-ecdd-47b0-be1b-e470b718fb51 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_2.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_2/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_2/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_2/log.txt
AutoML recommendation with experiment id 2 and job id 9266c02b-1605-41c9-9074-ae0616aadd6d submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_3.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_3/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_3/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_3/log.txt
AutoML recommendation with experiment id 3 and job id b5cedd49-607b-482f-bf13-e4aa1d109720 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_4.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_4/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_4/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_4/log.txt
AutoML recommendation with experiment id 4 and job id bfde6b5b-93ed-422f-87e7-6e04cb4a6d58 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_5.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_5/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_5/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_5/log.txt
AutoML recommendation with experiment id 5 and job id 7ecd264f-0ec1-4cf2-9e90-cf1c30458dee submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_6.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_6/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_6/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_6/log.txt
AutoML recommendation with experiment id 6 and job id 1d750111-2d6d-44c6-bdd1-3da031e4db6e submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_7.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_7/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_7/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_7/log.txt
AutoML recommendation with experiment id 7 and job id 7f35195b-c351-4db0-b852-fe12a2b2bcfc submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_8.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_8/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_8/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_8/log.txt
AutoML recommendation with experiment id 8 and job id 59bf549e-5d9e-49b6-ba9b-e092c6258ca0 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_9.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_9/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_9/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_9/log.txt
AutoML recommendation with experiment id 9 and job id a253d250-31a5-4c36-8d46-13199ac7fd03 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_10.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_10/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_10/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_10/log.txt
AutoML recommendation with experiment id 10 and job id 8f7854a7-1eeb-4e55-bdaa-fa14e25f5c99 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_11.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_11/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_11/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_11/log.txt
AutoML recommendation with experiment id 11 and job id c63797b5-5c1c-45b9-b706-ea1dec3eb6ee submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_12.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_12/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_12/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_12/log.txt
AutoML recommendation with experiment id 12 and job id 238ccf56-4fdc-4439-bbcf-52b46867c088 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_13.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_13/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_13/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_13/log.txt
AutoML recommendation with experiment id 13 and job id a05c9b1a-b011-4886-a391-119c46e0cfaa submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_14.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_14/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_14/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_14/log.txt
AutoML recommendation with experiment id 14 and job id aac9dc43-e23f-48cb-9b95-148b0fcd6ace submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_15.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_15/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_15/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_15/log.txt
AutoML recommendation with experiment id 15 and job id b500a649-c550-47e6-99de-509c98b2bb7d submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_16.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_16/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_16/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_16/log.txt
AutoML recommendation with experiment id 16 and job id f97cea44-f58f-4830-832c-bb825deab0f1 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_17.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_17/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_17/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_17/log.txt
AutoML recommendation with experiment id 17 and job id db4c0f98-c3df-447a-9244-ac09060ae8cc submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_18.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_18/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_18/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_18/log.txt
AutoML recommendation with experiment id 18 and job id d137729e-373c-4a34-ad3d-e90e976215ac submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_19.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_19/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_19/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_19/log.txt
AutoML recommendation with experiment id 19 and job id 3cdf44af-c0ef-4c08-99e5-acdd34d89c34 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_20.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_20/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_20/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_20/log.txt
AutoML recommendation with experiment id 20 and job id 3f3b3fc7-e32d-4964-8e26-27f79748d258 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_21.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_21/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_21/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_21/log.txt
AutoML recommendation with experiment id 21 and job id aa22a938-a967-4cef-9512-f1598496bbf4 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_22.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_22/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_22/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_22/log.txt
AutoML recommendation with experiment id 22 and job id 0eaf28de-ca90-4637-a885-03bc7afb30b1 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_23.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_23/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_23/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_23/log.txt
AutoML recommendation with experiment id 23 and job id 1c14fb12-43d3-4387-830a-9abfb7823a9f submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_24.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_24/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_24/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_24/log.txt
AutoML recommendation with experiment id 24 and job id e6269f25-b6b9-4f0c-ae93-50774ae2e8e3 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_25.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_25/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_25/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_25/log.txt
AutoML recommendation with experiment id 25 and job id a310808b-83ff-447d-8927-be69dccf3979 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_26.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_26/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_26/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_26/log.txt
AutoML recommendation with experiment id 26 and job id 59ac2817-fda1-45c0-b49b-7e723fa87fc2 submitted
Loaded AutoML specs
detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/recommendation_11.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_11/ --key=tlt_encode --gpus=2 --use_amp  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_11/log.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/d066c14d-cdc1-47b8-8ecf-c4c6abb0429c/a0fc3edc-28fb-4e7f-95ce-e2338602f12f/experiment_11/log.txt
AutoML recommendation with experiment id 11 and job id c63797b5-5c1c-45b9-b706-ea1dec3eb6ee submitted

Log from the AUTOML job pod generated:

kubectl logs a0fc3edc-28fb-4e7f-95ce-e2338602f12f-8rr5q
NGC CLI 3.23.0
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.0003961567999795079 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T11:00:43Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.00016320364375133067 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T11:39:17Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.011115627363324165 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T12:12:01Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.012782877311110497 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T12:37:09Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.012704013846814632 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T13:15:44Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.012742005288600922 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T13:54:36Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.012899148277938366 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T14:33:38Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.0005173633107915521 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T15:11:25Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.0004279972636140883 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T15:49:14Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.00520962942391634 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T16:28:02Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.00028941070195287466 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T17:06:23Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.00015078166325110942 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T17:45:25Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.004880583845078945 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T18:06:14Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.012657305225729942 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T18:44:02Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.004696693271398544 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T19:23:21Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.012759639881551266 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T19:59:51Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.004692853428423405 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T20:38:11Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.0050573041662573814 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T21:15:13Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.012663132511079311 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T21:54:17Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.00020110807963646948 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T22:33:20Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.012810888700187206 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T23:10:07Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.00042211421532556415 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-09T23:48:12Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.005165629554539919 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-10T00:27:18Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.00039220356848090887 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-10T01:06:23Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.004880324471741915 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-10T01:45:13Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.00017470787861384451 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-10T02:23:15Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow
Metric returned is 0.005112297832965851 at epoch/iter 10
Job deleted. status='{'startTime': '2023-08-10T02:59:33Z', 'active': 1, 'uncountedTerminatedPods': {}}'
Recommendation gotten
Recommendation submitted to workflow

So attach the Data generated inside this experiment 11.
log_exp11.txt (9.0 KB)
recommendation_11.protobuf (10.3 KB)

All the experiments .protobuf have included the pretrained_model_file: except the Experiment 11.

After this discovery. Sent a job/cancel message to the API. Terminate correctly, and try to restart the AutoML training.

The new training pod start in error with this message:

tkeic@azken:~$ kubectl logs a0fc3edc-28fb-4e7f-95ce-e2338602f12f-xd4hd
Traceback (most recent call last):
  File "/opt/api/automl_start.py", line 123, in <module>
    automl_start(
  File "/opt/api/automl_start.py", line 35, in automl_start
    controller.start()
  File "/opt/api/automl/controller.py", line 130, in start
    self._execute_loop()
  File "/opt/api/automl/controller.py", line 203, in _execute_loop
    self.run_experiments()
  File "/opt/api/automl/controller.py", line 216, in run_experiments
    recommended_specs = self.brain.generate_recommendations(history)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/api/automl/hyperband.py", line 269, in generate_recommendations
    if history[self.track_id].status not in [JobStates.success, JobStates.failure]:
               ^^^^^^^^^^^^^
AttributeError: 'HyperBand' object has no attribute 'track_id'

I’m the only one that can reach any goal with this SW?

Could you please also share the below files?
├── automl_metadata.json
├── controller.json
├── controller.log
|----- notebook

Can you delete the previous pods and double check? Also, suggest you to set epochs to a smaller one(for example, 3) in the training spec files to have a quick check if hyperband works. You can also use Bayesian to run as well.

automl_metadata.json (253 Bytes)
brain.json (609 Bytes)
controller.json (18.9 KB)
controller.log (100 Bytes)
current_rec.json (2 Bytes)

The notebook is the example API. Fow now I’m not able to invent things

I use 10 because waste more time validating than training, i don’t know how its better.

If possible, please share the notebook with us. I finish running notebooks/tao_api_starter_kit/api/object_detection.ipynb with hyperband. The training is working well. Could you please trigger a new autoML training?

I run suscesfully the Bayesian method.

I’ll launch a new Hyperband test, and send to you by PM the notebook with the results. But i think that I don`t have any extra modification.

Maybe I need more time, I stop the Automl process to do the other tests, and now can resume it (new failure)

Also remark that I’m using the --use-amp parametrer. Maybe you can reproduce the same behaviour with that.

If training for multi-gpus, for workaround, please disable amp or disable visualize during the detecnet_v2 training as mentioned in another topic.

1 Like

One question, why Hyperband evaluate in the first epoch? and not, when finish the configured num_epochs.
Why don’t use the information generated in the first evaluation for the rest of automl evaluations?
Currently expend more time evaluating than training…

I need to check if I can reproduce. Could you please share the notebook?

That’s expected for hyperband method in AutoML training. More info can be found in
https://docs.nvidia.com/tao/tao-toolkit/text/automl/automl.html#automl-algorithm-explanation

1 Like

Well, retry experiment without --use-amp and the same result.

The main Job get stuck.

The training Jobs finish in the number 11, with a failure in the train.
Attach log from the 11 test:
log.txt (10.3 KB)

And you can check the dates from the sequence:

Hi,
Could you please share the notebook as well? Especially I am going to check the specs parameters and which automl parameters you have added or removed.

Is in your PM since last week.

#"epoch_multiplier": 10, # Will be considered for Hyperband only and auto calculated

Could you please enable epoch_multiplier as below? From https://github.com/NVIDIA/tao_tutorials/blob/main/notebooks/tao_api_starter_kit/api/object_detection.ipynb

    automl_information = {"automl_enabled":automl_enabled,
                          "automl_algorithm":automl_algorithm,
                          "epoch_multiplier": 1, # Will be considered for Hyperband only
                          "metric":metric,
                          "automl_add_hyperparameters":str(additional_automl_parameters),
                          "automl_remove_hyperparameters":str(remove_default_automl_parameters)

I will also run something similar to your setting to check what will happen.

Try this setting, but need to stop in the middle. To do other stuff.
After stop the automl, i can’t resume it, and need start from 0…

Can you replicate it?

I am still working on that. Will update to you if any. Thanks.

1 Like

With your setting as below, I can reproduce the “AssertionError: Freeze blocks is only possible if a pretrained model file is provided” when hyperband trainings runs during 2nd SuccessiveHalving iteration.

    automl_information = {"automl_enabled":automl_enabled,
                          "automl_algorithm":automl_algorithm,
                          "automl_R": 27,
                          "automl_nu": 3,
                          "epoch_multiplier": 1, # Will be considered for Hyperband only
                          "metric":metric,
                          "automl_add_hyperparameters":str(additional_automl_parameters),
                          "automl_remove_hyperparameters":str(remove_default_automl_parameters)
                         }
    # Example for detectnet_v2 (for each network the parameter key might be different)
    specs["training_config"]["num_epochs"] = 3 # num_epochs is the parameter name for all object detection networks
    specs["gpus"] = 1
    specs["model_config"]["num_layers"] = 34
    specs["model_config"]["freeze_blocks"] = [0]
    specs["training_config"]["checkpoint_interval"] = 10
    specs["evaluation_config"]["first_validation_epoch"] = 10
    specs["evaluation_config"]["validation_period_during_training"] = 10

But I find that there is no error in the coming new iterations.
such as,
3rd iteration: experiment_11, experiment_25, experiment_1
4th iteration: experiment_1

I am afraid it is related to the setting. Actually you can continue to run the training.

I don’t understand.
How can I continue the 2nd SuccessiveHalving if the train process broke, and can’t resume?

Now i’m using 2 Gpus
(pd: I have a pending post to comment the pause/resume automl)

You can just let training continue. From my experiment, I find that there is not such issues in coming iterations. It can resume.

The error should not be related to gpus since I can reproduce the same error with one gpu.

1 Like

When finish a tests that now I have running, try to resume some of the others automl tests.

If you use “No freeze_blocks” don’t happend that?