Once the Best model is selected from AutoML what is the best way to evaluate (and convert to. something that can be used in Jetson TensorRT)

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) :DGX A100
• Network Type: SSD
• TAO Version: 4.0.2

Context:

I have made several AutoML runs and have got the files with the following format

├── automl_metadata.json
├── brain.json
├── controller.json
│ ...............
├── best_model
│   ├── log.txt
│   ├── recommendation_n.kitti
│   ├── status.json
│   ├── .....
│   └── weights
│       └── weights.tlt
│   ...............
├── experiment_0
│   ├── log.txt
│   ├── status.json
│   ├── .....
│   ...............
├── recommendation_0.kitti
├── recommendation_1.kitti
├── recommendation_2.kitti
..............

Once we ha have the best model recommendation spec we can create a non automl train job, apply the “best” specs run a training job, then evaluate and run inference.

we can do above using the following cell from the example notebooks

parent = train_job_id
actions = ["evaluate"]
data = json.dumps({"job": parent, "actions": actions})
endpoint = f"{base_url}/model/{model_ID}/job"
response = requests.post(endpoint, headers=headers, data=data, verify=rootca)
print(response)
print(json.dumps(response.json(), sort_keys=True, indent=4))

This creates an eval job

A pod in the k8 cluster (with the uuid is the sams as the eval job id)

which is then completed after the job is run

later confirmed in telemetry

in the job directory (inside the k8 pv we have the following structure)

73a30571-1430-43fd-bcdc-eb4b7de8c4fe/
└── status.json

the status.json file contents are

{"date": "7/14/2023", "time": "10:46:3", "status": "STARTED", "verbosity": "INFO", "message": "Starting SSD evaluation."}
{"date": "7/14/2023", "time": "10:49:39", "status": "SUCCESS", "verbosity": "INFO", "message": "Evaluation finished successfully.", "kpi": {"mAP": 0.2454126787090054}}

This pattern works for inference as well (the job is run and the artefacts are created in the job directory.

ac60ed9b-deaf-410e-9b68-c98109c16b4c/
├── images_annotated
│   ├── frame_000063.png
│   ├── frame_000072.png
│   ├── . . . . . . . . .
│   ├── . . . . . . . . .
│   └── frame_014625.png
├── labels
│   ├── frame_000063.txt
│   ├── frame_000072.txt
│   ├── . . . . . . . . .
│   ├── . . . . . . . . .
│   └── frame_014625.txt
└── status.json


the status.json is updated accordingly

{"date": "7/14/2023", "time": "11:29:48", "status": "STARTED", "verbosity": "INFO", "message": "Starting SSD Inference."}
{"date": "7/14/2023", "time": "11:33:34", "status": "SUCCESS", "verbosity": "INFO", "message": "Inference finished successfully."}

However things are not as straightforward when running an AutoML job. According to the blog post All we need to do is copy the "job id " of the AutoML job and use in the evaluation cells. (however in the defence of the article they are using the cli method, not he API method)

the method is

However when I apply this method

# run evaluation job
parent = automl_job_id
actions = ["evaluate"]
data = json.dumps({"job": parent, "actions": actions})
endpoint = f"{base_url}/model/{model_ID}/job"
response = requests.post(endpoint, headers=headers, data=data, verify=rootca)
print(response)
print(json.dumps(response.json(), sort_keys=True, indent=4))

I do not get beyond the Starting stage, I get the status

also I notice that a worker pod with job id is created and lives for around 15 seconds and completes (then gets garbage collected)

the job-id/status.json is stuck on

{"date": "7/14/2023", "time": "12:58:34", "status": "STARTED", "verbosity": "INFO", "message": "Starting SSD evaluation."}

logs/job-id.txt shows an exit on error.

2023-07-14 12:58:28.674118: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
_init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
2023-07-14 12:58:34,717 [INFO] iva.ssd.utils.spec_loader: Merging specification from /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/models/126c4600-a47d-4255-94d7-1b83946494a5/specs/5278eb86-68f7-4e12-96a1-08317fe5d903.yaml
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:95: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

2023-07-14 12:58:34,719 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:95: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:98: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

2023-07-14 12:58:34,719 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:98: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:102: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2023-07-14 12:58:34,721 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:102: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2023-07-14 12:58:34,721 [INFO] root: Starting SSD evaluation.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

2023-07-14 12:58:34,723 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

2023-07-14 12:58:34,730 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

2023-07-14 12:58:34,744 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.

2023-07-14 12:58:35,168 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

2023-07-14 12:58:35,310 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:181: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

2023-07-14 12:58:35,310 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:181: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:186: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2023-07-14 12:58:35,310 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:186: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

2023-07-14 12:58:35,994 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

2023-07-14 12:58:35,995 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

2023-07-14 12:58:36,647 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

2023-07-14 12:58:37,003 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.

2023-07-14 12:58:37,005 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

2023-07-14 12:58:37,574 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

2023-07-14 12:58:37,688 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1607, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape must be rank 4 but is rank 3 for 'decoded_predictions/batched_nms/CombinedNonMaxSuppression' (op: 'CombinedNonMaxSuppression') with input shapes: [?,1,4], [?,163], [], [], [], [].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/ssd/scripts/evaluate.py>", line 3, in <module>
  File "<frozen iva.ssd.scripts.evaluate>", line 260, in <module>
  File "<frozen iva.common.utils>", line 707, in return_func
  File "<frozen iva.common.utils>", line 695, in return_func
  File "<frozen iva.ssd.scripts.evaluate>", line 256, in main
  File "<frozen iva.ssd.scripts.evaluate>", line 141, in evaluate
  File "<frozen iva.ssd.builders.eval_builder>", line 35, in build
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/base_layer.py", line 457, in __call__
    output = self.call(inputs, **kwargs)
  File "<frozen iva.ssd.box_coder.output_decoder_layer>", line 115, in call
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/image_ops_impl.py", line 4029, in combined_non_max_suppression
    score_threshold, pad_per_class, clip_boxes)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_image_ops.py", line 570, in combined_non_max_suppression
    clip_boxes=clip_boxes, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1770, in __init__
    control_input_ops)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1610, in _create_c_op
    raise ValueError(str(e))
ValueError: Shape must be rank 4 but is rank 3 for 'decoded_predictions/batched_nms/CombinedNonMaxSuppression' (op: 'CombinedNonMaxSuppression') with input shapes: [?,1,4], [?,163], [], [], [], [].
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: __init__() missing 4 required positional arguments: 'code', 'msg', 'hdrs', and 'fp'
Execution status: FAIL

EOF

the normal jobs and AutoML jobs and the subsequent evaluation and inference attempts were made on the same dataset model and the hardware.

My questions are

  1. Is the output for the successful evaluation correct (using non AutoML: Because I only got the mAP vaalue, not the APs for the classes

  2. is there a way to run eval/inference on automl job without making a non automl run with the best specs from the automl run

  3. What is your recommendation (e.g. always do a normal (non automl) training run with the specs and use that for eval/train/prune/connvert/inference etc?)

Cheers,
Ganindu.

For 1), The status.json depends on specific network you are training. For your case, SSD network only shows mAP into the status.json. For AP value, please check the full training log.

For 2), The coming TAO 5.0 will support it.

For 3), For 4.0, you can use the end2end notebook. It will do a normal(non automl) training and then run eval/train/etc.

Thanks!! Straightforward answer!! Case closed!!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.