AutoML experiment aborted due to error but metadata thinks it is still "Running"

Please provide the following information when requesting support.

• Hardware: Nvidia DGX A100
• Network Type SSD
• TAO 4.0.1 (4.0.2 Helm Chart )

  • NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0

Hi, I have been running an AutoML job for a few hours until one experiment got an error and training was aborted (no gpu activity )

however the in the jobs_metadata/job-id.json file (in the k8 pv {k8-pv-root}/users/user-name/models/model-name/jobs_metadata/job-id.json)

{
    "id": "8d04c165-9c00-4a42-87b6-b99978e314e4",
    "parent_id": null,
    "action": "train",
    "created_on": "2023-06-28T17:24:32.019439",
    "last_modified": "2023-06-28T17:24:32.028681",
    "status": "Running",
    "result": {}
}

the workflow seems to think the job is running.

even though the failed experiment log

log.txt (93.0 KB)

says it has failed

Current pipeline object is no longer valid.
	 [[{{node Dali}}]]
0 successful operations.
0 derived errors ignored.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[43892,1],3]
  Exit code:    1
--------------------------------------------------------------------------
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

Note: “Per user-direction” doesn’t make sense because I did not do this (this happened during the night while i was away) is this the main process that supervise the training flow?

however I was able to “stop the job” (with the aim to resume)

with the command endpoint = f"{base_url}/model/{model_ID}/job/{job_id}/cancel"

and resume with endpoint = f"{base_url}/model/{model_ID}/job/{job_id}/resume"

and I can see that a new experiment has been started

before that (before I started and stopped the job via API calls) it was like this

and my ClearML dashboard also confirms with (gpu activity ramping up)


** Note: please note that the “_experiment_5” in clearml screenshot is a number I put and not referring to “experiment_0 to experiment_4 in the tao toolkit pv screenshots in other pictures” **

My question is:
Is this going to affect the AutoML job (because there is an “experiment_3” folder in the job directory that has not completed successfully )? because previously I have experienced that when consecutive jobs start (e.g. rather than getting queued the second job starts and the first job halts but in job metadata is stuck in the running state) even in this case when the ephemeral pods were removed after the stop command the metadata said the job was still running and API calls returned the status as running.

second question : if it is fine (what I have done (stopping and resuming via API calls)) is that the recommended way to tackle a problem like this?

UPDATE:
The error happened again at experiment 6


this time ClearML status is still running (maybe will automatically revert to aborted after a while of no telemetry

experiment log file
log.txt (92.8 KB)
pods before stopping and starting the task (looks like the pod corresponding to experiment_6 crashed(as showen in the log file attached below) without some recovery behaviour on the part of the greater AutoML parent task)

NAME                                            READY   STATUS    RESTARTS   AGE
8d04c165-9c00-4a42-87b6-b99978e314e4-m6th5      1/1     Running   0          4h28m
ingress-nginx-controller-5cdbcc9966-lwqn5       1/1     Running   0          20h
tao-toolkit-api-app-pod-6bf85c898-pz8nc         1/1     Running   0          20h
tao-toolkit-api-workflow-pod-5576cfbc4f-j5h78   1/1     Running   0          20h

status.json for the experiment

{"date": "6/29/2023", "time": "13:10:17", "status": "STARTED", "verbosity": "INFO", "message": "Starting Training Loop."}

jobs_metadats/automl-job-id.json output is still running

{
    "id": "8d04c165-9c00-4a42-87b6-b99978e314e4",
    "parent_id": null,
    "action": "train",
    "created_on": "2023-06-29T09:03:37.642173",
    "last_modified": "2023-06-29T09:03:37.674708",
    "status": "Running",
    "result": {}
}

afterwards I stopped the job from the REST api call as above (ephemeral job pods got terminated and then cleaned up

NAME                                            READY   STATUS    RESTARTS   AGE
ingress-nginx-controller-5cdbcc9966-lwqn5       1/1     Running   0          20h
tao-toolkit-api-app-pod-6bf85c898-pz8nc         1/1     Running   0          20h
tao-toolkit-api-workflow-pod-5576cfbc4f-j5h78   1/1     Running   0          20h

after resuming new ones spun up (8d04c165-9c00-4a42-87b6-b99978e314e4-49shh for the parent AutoML job and 516a3573-4530-4b28-93e6-bbfefbeebd78-xhvgx for the now “experiment_7” )

NAME                                            READY   STATUS    RESTARTS   AGE
516a3573-4530-4b28-93e6-bbfefbeebd78-xhvgx      1/1     Running   0          85s
8d04c165-9c00-4a42-87b6-b99978e314e4-49shh      1/1     Running   0          92s
ingress-nginx-controller-5cdbcc9966-lwqn5       1/1     Running   0          20h
tao-toolkit-api-app-pod-6bf85c898-pz8nc         1/1     Running   0          20h
tao-toolkit-api-workflow-pod-5576cfbc4f-j5h78   1/1     Running   0          20h

and the new experiment started (and the clearml job list updated to that but the non working one is still in the “Running” state (i later manually changed the status in clearml to “Aborted”) this time, the status.json of the crashed experiment also stayed the same ({"date": "6/29/2023", "time": "13:10:17", "status": "STARTED", "verbosity": "INFO", "message": "Starting Training Loop."} ))

again is this fine? is it ok to keep stopping and resuming to get the AutoML job out of the crashed/stuck state (results in the end will still be useful? or are they affected because not all experiments ran (maybe its a good thing they failed because of numerical stability from the augmented hyperparams caused the crash?))

suggestion, will deleting the folder and some info on the failed experiment help proper continuation rather than skipping?

Firstly, let’s check the error in the training.

2023-06-28 22:52:37,955 [INFO] root: Starting Training Loop.
Epoch 1/85
DALI daliShareOutput(&pipe_handle_) failed: Critical error in pipeline:
Error when executing CPU operator random__Uniform, instance name: "__Uniform_5", encountered:
[/opt/dali/dali/operators/random/uniform_distribution.h:148] Assert on "end > start" failed: Invalid range [0, 0).
Stacktrace (10 entries):
[frame 0]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x413ace) [0x7fe94418dace]
[frame 1]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x246a26f) [0x7fe9461e426f]
[frame 2]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali_operators.so(+0x246a620) [0x7fe9461e4620]
[frame 3]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(void dali::Executor<dali::AOT_WS_Policy<dali::UniformQueuePolicy>, dali::UniformQueuePolicy>::RunHelper<dali::HostWorkspace>(dali::OpNode&, dali::HostWorkspace&)+0x7c5) [0x7feae6586765]
[frame 4]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(dali::Executor<dali::AOT_WS_Policy<dali::UniformQueuePolicy>, dali::UniformQueuePolicy>::RunCPU()+0x354) [0x7feae658a0e4]
[frame 5]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0xbc4bd) [0x7feae65434bd]
[frame 6]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x138f7c) [0x7feae65bff7c]
[frame 7]: /usr/local/lib/python3.6/dist-packages/nvidia/dali/libdali.so(+0x8a213f) [0x7feae6d2913f]
[frame 8]: /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7feb129ba609]
[frame 9]: /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7feb12af4133]

For ssd , please change tfrecord format to sequence format.
i.e. , use below format.
to

 data_sources: {
    label_directory_path: "/workspace/tao-experiments/Data/kitti_data/training/label"
    image_directory_path: "/workspace/tao-experiments/Data/kitti_data/training/image"
 }

Can this change can be done through a REST API call?

in my k8 pv directory ({PV_ROOT}/users/f2d3c55a-f3dd-5dff-badc-851e27460122/models/1badce1d-11ce-406a-a0ed-ae8567b176f2/specs)

I have these files

-rwxrwxrwx 1 root     root     2065 Jun 28 10:43 1d629677-b01f-4e32-a498-e554cc5776b9.yaml
-rwxrwxrwx 1 root     root     2224 Jun 28 11:42 29e8e57c-1167-4b1d-99df-46f690053f26.yaml
-rwxrwxrwx 1 root     root     2065 Jun 28 09:58 4151a81f-d9f5-440a-937c-e073fd3a943e.yaml
-rwxrwxrwx 1 root     root     2225 Jun 28 12:14 7a0e6372-d823-452a-bac0-68c4a3f30d01.yaml
-rwxrwxrwx 1 root     root     2065 Jun 28 11:06 b2d2bbe2-574c-420f-ac11-a552378dd74e.yaml
-rwxrwxrwx 1 root     root     2065 Jun 27 17:32 c8e17b52-59ce-4199-b713-34f411768d37.yaml
-rwxrwxrwx 1 www-data www-data 2666 Jun 28 18:24 train.json

when I upload the spec via the API call ( with endpoint = f"{base_url}/model/{model_ID}/specs/train") I can see that it goes to the train.json file in the specs directory for the model. and it does not have the “data_sources” field you’ve mentioned in your response

train.json

{
    "version": "1",
    "random_seed": 42,
    "dataset_config": {
        "target_class_mapping": [
          ...
        ],
        "include_difficult_in_training": true
    },
    "training_config": {
        "batch_size_per_gpu": 10,
        "num_epochs": 85,
        "enable_qat": false,
        "learning_rate": {
            "soft_start_annealing_schedule": {
                "min_learning_rate": 5e-05,
                "max_learning_rate": 0.009,
                "soft_start": 0.1,
                "annealing": 0.8
            }
        },
        "regularizer": {
            "type": "__L1__",
            "weight": 3e-05
        },
        "checkpoint_interval": 10,
        "max_queue_size": 16,
        "n_workers": 8,
        "visualizer": {
            "num_images": 3,
            "enabled": true,
            "clearml_config": {
                "project": "nozzlenet_0_3_1",
                "tags": [
                    "training",
                    "tao_toolkit"
                ],
                "task": "training_x_0.3.1_experiment_5_automl_enabled"
            }
        }
    },
    "eval_config": {
        "average_precision_mode": "__SAMPLE__",
        "validation_period_during_training": 20,
        "batch_size": 16,
        "matching_iou_threshold": 0.5
    },
    "nms_config": {
        "confidence_threshold": 0.01,
        "clustering_iou_threshold": 0.6,
        "top_k": 200
    },
    "augmentation_config": {
        "output_width": 960,
        "output_height": 544,
        "output_channel": 3,
        "random_crop_min_scale": 0.3,
        "random_crop_max_scale": 1.0,
        "random_crop_min_ar": 0.5,
        "random_crop_max_ar": 2.0,
        "zoom_out_min_scale": 1.0,
        "zoom_out_max_scale": 4.0,
        "brightness": 32,
        "contrast": 0.5,
        "saturation": 0.5,
        "hue": 18
    },
    "ssd_config": {
        "aspect_ratios_global": "[1.0,2.0,0.5,3.0,1.0/3.0]",
        "two_boxes_for_ar1": true,
        "clip_boxes": false,
        "variances": "[0.1,0.1,0.2,0.2]",
        "scales": "[0.05,0.1,0.25,0.4,0.55,0.7,0.85]",
        "arch": "resnet",
        "nlayers": 18,
        "freeze_bn": false,
        "freeze_blocks": [
            0
        ]
    }
}

however I see the other files (but i cant see these files being refereed anywhere ()i did a grep -rn search )

1d629677-b01f-4e32-a498-e554cc5776b9.yaml
4151a81f-d9f5-440a-937c-e073fd3a943e.yaml
b2d2bbe2-574c-420f-ac11-a552378dd74e.yaml 
29e8e57c-1167-4b1d-99df-46f690053f26.yaml 
7a0e6372-d823-452a-bac0-68c4a3f30d01.yaml
c8e17b52-59ce-4199-b713-34f411768d37.yaml

if i pick one of them, let’s say (c8e17b52-59ce-4199-b713-34f411768d37.yaml)

random_seed: 42
dataset_config {
  target_class_mapping {
  ...
  }
  target_class_mapping {
   ...
  }

 ...

  target_class_mapping {
    ...
  }
  include_difficult_in_training: True
  data_sources {
    tfrecords_path: "/shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/d0d476bb-20a0-41d3-997d-5781c57903e9/tfrecords/tfrecords-*"
  }
  validation_data_sources {
    label_directory_path: "/shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/33e4a372-ea20-4ac6-9cde-968a2675a472/labels"
    image_directory_path: "/shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/33e4a372-ea20-4ac6-9cde-968a2675a472/images"
  }
}
training_config {
  batch_size_per_gpu: 10
  num_epochs: 10
  enable_qat: False
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-05
      max_learning_rate: 0.009
      soft_start: 0.1
      annealing: 0.8
    }
  }
  regularizer {
    type: L1
    weight: 3e-05
  }
  checkpoint_interval: 10
  max_queue_size: 16
  n_workers: 8
  visualizer {
    num_images: 3
    enabled: False
  }
}
eval_config {
  average_precision_mode: SAMPLE
  validation_period_during_training: 5
  batch_size: 16
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.01
  clustering_iou_threshold: 0.6
  top_k: 200
}
augmentation_config {
  output_width: 960
  output_height: 544
  output_channel: 3
  random_crop_min_scale: 0.3
  random_crop_max_scale: 1.0
  random_crop_min_ar: 0.5
  random_crop_max_ar: 2.0
  zoom_out_min_scale: 1.0
  zoom_out_max_scale: 4.0
  brightness: 32
  contrast: 0.5
  saturation: 0.5
  hue: 18
}
ssd_config {
  aspect_ratios_global: "[1.0,2.0,0.5,3.0,1.0/3.0]"
  two_boxes_for_ar1: True
  clip_boxes: False
  variances: "[0.1,0.1,0.2,0.2]"
  scales: "[0.05,0.1,0.25,0.4,0.55,0.7,0.85]"
  arch: "resnet"
  nlayers: 18
  freeze_bn: False
  freeze_blocks: 0
}

and i can see that it has the "data_sources " field you mention.

"

 data_sources {
    tfrecords_path: "/shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/d0d476bb-20a0-41d3-997d-5781c57903e9/tfrecords/tfrecords-*"
  }

"

also in Action Specs - NVIDIA Docs i can see this appearing.

so where do i exactly make the change you suggest? also can i ask why do you think this may be crashing the autoML job fron time to time?

Cheers,
Ganindu.

For ssd and dssd, it is expected to use sequence format instead of tfreocrd files.
Similar info is posted in DSSD resume error - #30 by Morganh

For changing the spec file, you can find similar cell in the notebook.

`# Customize train model specs`

Then apply for the changes.

Hi Thanks a lot for the quick reply

I changed the training spec with these lines in my python code

specs["dataset_config"]["data_sources"] = {}
specs["dataset_config"]["data_sources"]["image_directory_path"] = "/shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/d0d476bb-20a0-41d3-997d-5781c57903e9/images"
specs["dataset_config"]["data_sources"]["label_directory_path"] = "/shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/d0d476bb-20a0-41d3-997d-5781c57903e9/labels"

and when I do the POST call I can see that in the TAO pv ../models/model-id/specs/train.json file gets the updated data_sources field.

{
    "version": "1",
    "random_seed": 42,
    "dataset_config": {
        "target_class_mapping": [
                ...
        ],
        "include_difficult_in_training": true,
        "data_sources": {
            "image_directory_path": "/shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/d0d476bb-20a0-41d3-997d-5781c57903e9/images",
            "label_directory_path": "/shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/d0d476bb-20a0-41d3-997d-5781c57903e9/labels"
        }
    },
    "training_config": {
        "batch_size_per_gpu": 10,
        "num_epochs": 85,
        "enable_qat": false,
        "learning_rate": {
            "soft_start_annealing_schedule": {
                "min_learning_rate": 5e-05,
                "max_learning_rate": 0.009,
                "soft_start": 0.1,
                "annealing": 0.8
            }
        },
...

and then i stopped i resumed the training parent job (not sure if i need to delete the parent job (autoML job )and start a new one for the changes to propagate from the train.json into the experiment)

then a new pod for the new automl sub task (experiment_10) started

then I checked ‘recommendation_10.kitti’(because it corresponds to the currently active experiment run) and it still had tfrecords as the data source

random_seed: 42
dataset_config {
  target_class_mapping {
     ...
  }
  include_difficult_in_training: True
  data_sources {
    tfrecords_path: "/shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/d0d476bb-20a0-41d3-997d-5781c57903e9/tfrecords/tfrecords-*"
  }
  validation_data_sources {
    label_directory_path: "/shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/33e4a372-ea20-4ac6-9cde-968a2675a472/labels"
    image_directory_path: "/shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/33e4a372-ea20-4ac6-9cde-968a2675a472/images"
  }
}
training_config {
  batch_size_per_gpu: 10
  num_epochs: 85
  enable_qat: False
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 2.522826753345558e-05
      max_learning_rate: 0.009
      soft_start: 0.0012354747837804941
      annealing: 0.8
    }
  }
  regularizer {
    type: L2
    weight: 1.2143754237704272e-05
  }
  checkpoint_interval: 10
  max_queue_size: 16
  n_workers: 8
  visualizer {
    num_images: 3
    enabled: True
    clearml_config {
      project: "nozzlenet_0_3_1"
      tags: "training"
      tags: "tao_toolkit"
      task: "training_nozzlenet_0.3.1_experiment_5_automl_enabled"
    }
  }
}
eval_config {
  average_precision_mode: SAMPLE
  validation_period_during_training: 20
  batch_size: 16
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.01
  clustering_iou_threshold: 0.6
  top_k: 200
}
augmentation_config {
  output_width: 960
  output_height: 544
  output_channel: 3
  random_crop_min_scale: 0.3367314893420588
  random_crop_max_scale: 0.6026062939329949
  random_crop_min_ar: 0.5
  random_crop_max_ar: 9.815633479815645
  zoom_out_min_scale: 1.0
  zoom_out_max_scale: 2.994642451458644
  brightness: 32
  contrast: 0.5
  saturation: 0.5
  hue: 18
}
ssd_config {
  aspect_ratios_global: "[1.0,2.0,0.5,3.0,1.0/3.0]"
  two_boxes_for_ar1: True
  clip_boxes: False
  variances: "[0.1,0.1,0.2,0.2]"
  scales: "[0.05,0.1,0.25,0.4,0.55,0.7,0.85]"
  arch: "resnet"
  nlayers: 18
  freeze_bn: False
  freeze_blocks: 0
}

Am I doing this wrong (postings the updated specs in an incorrect way) ? or i need to restart the parent automl job (not just atop and resume to refresh experiment pods? or do i need to do something completely different? or maybe i should fiddle with the dataset config instead of model config?

Apologies if I am following the instructions incorrectly, if do please do spoonfeed a little because I may be a bit lost here.

Cheers,
Ganindu.

P.S

Also I noticed that the parent job (autoML main job ) cannot be deleted (I think because the autoML job always is in the running state)

cancel (endpoint = f"{base_url}/model/{model_ID}/job/{job_id}/cancel") works

returns

<Response [200]>
{}

delete (shown bekow) fails

endpoint = f"{base_url}/model/{model_ID}/job/{job_id}"
response = requests.delete(endpoint, headers=headers, verify=rootca)

returns

<Response [400]>
{
    "error_code": 400,
    "error_desc": "job cannot be deleted"
}

even if i try delete it is always in “Running”

endpoint = f"{base_url}/model/{model_ID}/job"
response = requests.get(endpoint, headers=headers, verify=rootca)

returms

<Response [200]>
[
    {
        "action": "train",
        "created_on": "2023-06-30T11:16:13.859260",
        "id": "8d04c165-9c00-4a42-87b6-b99978e314e4",
        "last_modified": "2023-06-30T11:16:13.901460",
        "parent_id": null,
        "result": {},
        "status": "Running"
    }
]

UPDATE 2:

even if i create a new model and a new autoML job the data source in recommendation_0.kitti is

data_sources {
    tfrecords_path: "/shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/d0d476bb-20a0-41d3-997d-5781c57903e9/tfrecords/tfrecords-*"
  }

maybe the change has to be made on dataset level? not model stage?

UPDATE 2.1:

I had a look at the model directory and I didn’t have much luck in finding a location to override tfrecords

Could you please share the latest notebook? Thanks a lot. I will try to reproduce on my side.

will do in a private message (in case I get sloppy and don’t strip out everything not useful here)

Update: In the training stage, it can use tfrecord files. Only the validation dataset needs to use sequence format. In other word, currently, SSD and DSSD do not support running evaluation against tfrecords file during training.

I can run default notebook GPU-optimized AI, Machine Learning, & HPC Software | NVIDIA NGC successfully.
Please use it GPU-optimized AI, Machine Learning, & HPC Software | NVIDIA NGC to run your dataset again. To check if there is still error log.

Epoch 1/85
DALI daliShareOutput(&pipe_handle_) failed: Critical error in pipeline:
Error when executing CPU operator random__Uniform, instance name: "__Uniform_5", encountered:
[/opt/dali/dali/operators/random/uniform_distribution.h:148] Assert on "end > start" failed: Invalid range [0, 0).

By evaluation you mean the AP calculations every n epochs that we seen in the logs/terminal and use in the telemetry plots (in clearML) dies that mean those AP calculations are incorrect? 🧐

Cheers,
Ganindu.

No, just mean for SSD or DSSD, validation_data_sources can only support below format. This is sequence format. Tfrecord format is not supported.

  validation_data_sources {
    label_directory_path: "/shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/33e4a372-ea20-4ac6-9cde-968a2675a472/labels"
    image_directory_path: "/shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/datasets/33e4a372-ea20-4ac6-9cde-968a2675a472/images"
  }

For above error log Invalid range [0, 0) , there is a known issue which we will fix in next release. Please refer to below autoML notebook
GPU-optimized AI, Machine Learning, & HPC Software | NVIDIA NGC ,

Refer to these hyper-links to see the parameters supported by each network and add more parameters if necessary in addition to the default automl enabled parameters:
[SSD](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_action_specs.html#id62) 

When run into below cell, please put augmentation_config.hue into below remove_default_automl_parameters

remove_default_automl_parameters = [] #Remove any hyperparameters that are enabled by default for AutoML

The workaround is to disable this parameter for AutoML.

Ok that is good news, so the

specs["dataset_config"]["data_sources"]["data_sources"] = "..."

not propagating to the child job specs of the AutoML jobs did not create any material harm (now I have removed those lines)

I believe

specs["dataset_config"]["data_sources"]["validation_data_sources"] = "..."

is not needed as I can see that it has already been populated in the recommendation_n.kitti files with the correct values.

So TLDR;

We don’t need to meddle with data sources in the context of the problem I am facing currently.

in this case, I can see that “hue delta” (if I understood correctly) in the default config under “augmentation_config” is 18

"augmentation_config": {
            "automl_default_parameters": [
                "augmentation_config.random_crop_min_scale"
            ],
            "default": {
                "brightness": 32,
                "contrast": 0.5,
                "hue": 18,
                "output_channel": 3,
                "output_height": 544,
                "output_width": 960,
                "random_crop_max_ar": 2.0,
                "random_crop_max_scale": 1.0,
                "random_crop_min_ar": 0.5,
                "random_crop_min_scale": 0.3,
                "saturation": 0.5,
                "zoom_out_max_scale": 4.0,
                "zoom_out_min_scale": 1.0

and within the hue section under the same heading it is ranged to (0 and 180)

                "hue": {
                    "default": 18,
                    "description": "Hue delta in color jittering augmentation",
                    "maximum": 180,
                    "minimum": 0,
                    "title": "Hue",
                    "type": "integer"

if I go the logs of the failed experiment i can see that the hue delta is still 18 and the value used in the failed experiment (experiment_3) is 18?

  augmentation_config {
  output_width: 960
  output_height: 544
  output_channel: 3
  random_crop_min_scale: 0.1287167832704614
  random_crop_max_scale: 0.8258323310869657
  random_crop_min_ar: 0.5
  random_crop_max_ar: 1.940118528255783
  zoom_out_min_scale: 1.0
  zoom_out_max_scale: 1.9771026407043837
  brightness: 32
  contrast: 0.5
  saturation: 0.5
  hue: 18
}

so it is breaking because hue delta is hitting 0? (just curious )

also in the controller.json

we have

        "id":3,
        "job_id":"736c0d0e-fe4d-45a0-bc7a-9eca0f03c18a",
        "result":0.0,
        "specs":{
            "augmentation_config.random_crop_max_ar":1.940118528255783,
            "augmentation_config.random_crop_max_scale":0.8258323310869657,
            "augmentation_config.random_crop_min_scale":0.1287167832704614,
            "augmentation_config.zoom_out_max_scale":1.9771026407043837,
            "training_config.learning_rate.soft_start_annealing_schedule.min_learning_rate":1.4812466900605683e-05,
            "training_config.learning_rate.soft_start_annealing_schedule.soft_start":0.08723594214120801,
            "training_config.regularizer.type":"__L1__",
            "training_config.regularizer.weight":5.348097631693618e-06
        },
        "status":"started"
    }

it does not look in augmentation_config hue delta was touched (changed other stuff while hue was unchanged)? (basically there was no delta (hue delta == 0))

anyway I will add the line

remove_default_automl_parameters = ['augmentation_config.hue']

to my config, the changes now looks as shown below

additional_automl_parameters = ['augmentation_config.random_crop_max_scale', 'augmentation_config.random_crop_max_ar', 'augmentation_config.zoom_out_max_scale'] #Refer to parameter list mentioned in the above links and add any extra parameter in addition to the default enabled ones
remove_default_automl_parameters = ['augmentation_config.hue']

I will do another run and let you know!

(Hopefully the comments I made regarding the changes I made per your suggestions are clear, please let me know if they aren’t )

Cheers,
(attached log files)
automl_job_and_failed_experiment.zip (17.7 KB)

Also what does “recommdation_n.kitti” does do. it seems it is populated before the run (does it update as the training run goes

# recommendation_0.kitti (completed)

augmentation_config {
  output_width: 960
  output_height: 544
  output_channel: 3
  random_crop_min_scale: 0.3367314893420588
  random_crop_max_scale: 0.6026062939329949
  random_crop_min_ar: 0.5
  random_crop_max_ar: 9.815633479815645
  zoom_out_min_scale: 1.0
  zoom_out_max_scale: 2.994642451458644
  brightness: 32
  contrast: 0.5
  saturation: 0.5
  hue: 18
}

# recommendation_1.kitti (completed)

augmentation_config {
  output_width: 960
  output_height: 544
  output_channel: 3
  random_crop_min_scale: 0.14269537759734444
  random_crop_max_scale: 0.27145220896612043
  random_crop_min_ar: 0.5
  random_crop_max_ar: 2.320516477270044
  zoom_out_min_scale: 1.0
  zoom_out_max_scale: 2.160298015452229
  brightness: 32
  contrast: 0.5
  saturation: 0.5
  hue: 18
}

# recommendation_2.kitti (completed)

augmentation_config {
  output_width: 960
  output_height: 544
  output_channel: 3
  random_crop_min_scale: 0.08529945631107984
  random_crop_max_scale: 0.25564709031113175
  random_crop_min_ar: 0.5
  random_crop_max_ar: 1.3770861201923306
  zoom_out_min_scale: 1.0
  zoom_out_max_scale: 3.9457875544654337
  brightness: 32
  contrast: 0.5
  saturation: 0.5
  hue: 18
}

# recommendation_3.kitti (failed)

augmentation_config {
  output_width: 960
  output_height: 544
  output_channel: 3
  random_crop_min_scale: 0.1287167832704614
  random_crop_max_scale: 0.8258323310869657
  random_crop_min_ar: 0.5
  random_crop_max_ar: 1.940118528255783
  zoom_out_min_scale: 1.0
  zoom_out_max_scale: 1.9771026407043837
  brightness: 32
  contrast: 0.5
  saturation: 0.5
  hue: 18
}

UPDATE:

  1. Opened an Issue in nvidia DALI repo. I assume @Morganh is still on the case.

  2. I am now trying to replicate the issue wit the FLIR dataset (used in this notebook.) In case the issue has to do with our dataset. (maybe size, type, some label attribute) (still not clear why the issue is intermittent though (happens only in certain runs))

Update: I am running with the FLIR dataset in this notebook. Have not reproduced the error yet. I even did not set any in remove_default_automl_parameters.

Please set augmentation_config.random_crop_min_scale into remove_default_automl_parameters.
Update: Issue is gone according to the feedback offline when run your own dataset.

Yes! I was able to complete a run with that and afterwards with

additional_automl_parameters = [] # Refer to parameter list mentioned in the above links and add any extra parameter in addition to the default enabled ones
remove_default_automl_parameters = [] # Remove any hyperparameters that are enabled by default for AutoML

too… despite having some failed child jobs (According to the logs, but I still got a full AutoML run and the best model was saved in the model folder)

TLDR: It maybe be not perfect but it seem to work despite the error messages. (full logs are shared with @Morganh )

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.