Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc) :DGX A100
• Network Type: SSD
• TAO Version: 4.0.2
Context:
I have made several AutoML runs and have got the files with the following format
├── automl_metadata.json
├── brain.json
├── controller.json
│ ...............
├── best_model
│ ├── log.txt
│ ├── recommendation_n.kitti
│ ├── status.json
│ ├── .....
│ └── weights
│ └── weights.tlt
│ ...............
├── experiment_0
│ ├── log.txt
│ ├── status.json
│ ├── .....
│ ...............
├── recommendation_0.kitti
├── recommendation_1.kitti
├── recommendation_2.kitti
..............
Once we ha have the best model recommendation spec we can create a non automl train job, apply the “best” specs run a training job, then evaluate and run inference.
we can do above using the following cell from the example notebooks
parent = train_job_id
actions = ["evaluate"]
data = json.dumps({"job": parent, "actions": actions})
endpoint = f"{base_url}/model/{model_ID}/job"
response = requests.post(endpoint, headers=headers, data=data, verify=rootca)
print(response)
print(json.dumps(response.json(), sort_keys=True, indent=4))
This creates an eval job
A pod in the k8 cluster (with the uuid is the sams as the eval job id)
which is then completed after the job is run
later confirmed in telemetry
in the job directory (inside the k8 pv we have the following structure)
73a30571-1430-43fd-bcdc-eb4b7de8c4fe/
└── status.json
the status.json
file contents are
{"date": "7/14/2023", "time": "10:46:3", "status": "STARTED", "verbosity": "INFO", "message": "Starting SSD evaluation."}
{"date": "7/14/2023", "time": "10:49:39", "status": "SUCCESS", "verbosity": "INFO", "message": "Evaluation finished successfully.", "kpi": {"mAP": 0.2454126787090054}}
This pattern works for inference as well (the job is run and the artefacts are created in the job directory.
ac60ed9b-deaf-410e-9b68-c98109c16b4c/
├── images_annotated
│ ├── frame_000063.png
│ ├── frame_000072.png
│ ├── . . . . . . . . .
│ ├── . . . . . . . . .
│ └── frame_014625.png
├── labels
│ ├── frame_000063.txt
│ ├── frame_000072.txt
│ ├── . . . . . . . . .
│ ├── . . . . . . . . .
│ └── frame_014625.txt
└── status.json
the status.json
is updated accordingly
{"date": "7/14/2023", "time": "11:29:48", "status": "STARTED", "verbosity": "INFO", "message": "Starting SSD Inference."}
{"date": "7/14/2023", "time": "11:33:34", "status": "SUCCESS", "verbosity": "INFO", "message": "Inference finished successfully."}
However things are not as straightforward when running an AutoML job. According to the blog post All we need to do is copy the "job id " of the AutoML job and use in the evaluation cells. (however in the defence of the article they are using the cli method, not he API method)
the method is
However when I apply this method
# run evaluation job
parent = automl_job_id
actions = ["evaluate"]
data = json.dumps({"job": parent, "actions": actions})
endpoint = f"{base_url}/model/{model_ID}/job"
response = requests.post(endpoint, headers=headers, data=data, verify=rootca)
print(response)
print(json.dumps(response.json(), sort_keys=True, indent=4))
I do not get beyond the Starting stage, I get the status
also I notice that a worker pod with job id is created and lives for around 15 seconds and completes (then gets garbage collected)
the job-id/status.json
is stuck on
{"date": "7/14/2023", "time": "12:58:34", "status": "STARTED", "verbosity": "INFO", "message": "Starting SSD evaluation."}
logs/job-id.txt
shows an exit on error.
2023-07-14 12:58:28.674118: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
RequestsDependencyWarning)
_init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
RequestsDependencyWarning)
2023-07-14 12:58:34,717 [INFO] iva.ssd.utils.spec_loader: Merging specification from /shared/users/f2d3c55a-f3dd-5dff-badc-851e27460122/models/126c4600-a47d-4255-94d7-1b83946494a5/specs/5278eb86-68f7-4e12-96a1-08317fe5d903.yaml
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:95: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.
2023-07-14 12:58:34,719 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:95: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:98: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.
2023-07-14 12:58:34,719 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:98: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:102: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
2023-07-14 12:58:34,721 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:102: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
2023-07-14 12:58:34,721 [INFO] root: Starting SSD evaluation.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
2023-07-14 12:58:34,723 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.
2023-07-14 12:58:34,730 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.
2023-07-14 12:58:34,744 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.
2023-07-14 12:58:35,168 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.
2023-07-14 12:58:35,310 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:181: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
2023-07-14 12:58:35,310 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:181: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:186: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
2023-07-14 12:58:35,310 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:186: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
2023-07-14 12:58:35,994 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.
2023-07-14 12:58:35,995 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.
2023-07-14 12:58:36,647 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
2023-07-14 12:58:37,003 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.
2023-07-14 12:58:37,005 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.
2023-07-14 12:58:37,574 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.
2023-07-14 12:58:37,688 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1607, in _create_c_op
c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape must be rank 4 but is rank 3 for 'decoded_predictions/batched_nms/CombinedNonMaxSuppression' (op: 'CombinedNonMaxSuppression') with input shapes: [?,1,4], [?,163], [], [], [], [].
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "</usr/local/lib/python3.6/dist-packages/iva/ssd/scripts/evaluate.py>", line 3, in <module>
File "<frozen iva.ssd.scripts.evaluate>", line 260, in <module>
File "<frozen iva.common.utils>", line 707, in return_func
File "<frozen iva.common.utils>", line 695, in return_func
File "<frozen iva.ssd.scripts.evaluate>", line 256, in main
File "<frozen iva.ssd.scripts.evaluate>", line 141, in evaluate
File "<frozen iva.ssd.builders.eval_builder>", line 35, in build
File "/usr/local/lib/python3.6/dist-packages/keras/engine/base_layer.py", line 457, in __call__
output = self.call(inputs, **kwargs)
File "<frozen iva.ssd.box_coder.output_decoder_layer>", line 115, in call
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/image_ops_impl.py", line 4029, in combined_non_max_suppression
score_threshold, pad_per_class, clip_boxes)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_image_ops.py", line 570, in combined_non_max_suppression
clip_boxes=clip_boxes, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1770, in __init__
control_input_ops)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1610, in _create_c_op
raise ValueError(str(e))
ValueError: Shape must be rank 4 but is rank 3 for 'decoded_predictions/batched_nms/CombinedNonMaxSuppression' (op: 'CombinedNonMaxSuppression') with input shapes: [?,1,4], [?,163], [], [], [], [].
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: __init__() missing 4 required positional arguments: 'code', 'msg', 'hdrs', and 'fp'
Execution status: FAIL
EOF
the normal jobs and AutoML jobs and the subsequent evaluation and inference attempts were made on the same dataset model and the hardware.
My questions are
-
Is the output for the successful evaluation correct (using non AutoML: Because I only got the mAP vaalue, not the APs for the classes
-
is there a way to run eval/inference on automl job without making a non automl run with the best specs from the automl run
-
What is your recommendation (e.g. always do a normal (non automl) training run with the specs and use that for eval/train/prune/connvert/inference etc?)
Cheers,
Ganindu.