• Hardware (T4/V100/Xavier/Nano/etc) - T4
• Network Type - EfficientDet-TF2
• TAO Version - toolkit_version: 4.0.1
dockers:
nvidia/tao/tao-toolkit:
4.0.0-tf2.9.1:
docker_registry: nvcr.io
tasks:
1. classification_tf2
2. efficientdet_tf2
• Training spec file(If have, please share here)
data:
loader:
prefetch_size: 4
shuffle_file: True
num_classes: 97
image_size: '416x416'
max_instances_per_image: 10
train_tfrecords:
- '/workspace/tao-experiments/data/train/tf_records/train-*'
val_tfrecords:
- '/workspace/tao-experiments/data/val/tf_records/val-*'
val_json_file: '/workspace/tao-experiments/data/val/annotations.json'
train:
checkpoint: "/workspace/tao-experiments/data/efficientdet_tf2/retail_detector_100.tlt"
num_examples_per_epoch: 1000
model:
name: 'efficientdet-d5'
key: 'nvidia_tlt'
results_dir: '/workspace/tao-experiments/efficientdet_tf2/experiment_dir_unpruned'**
• How to reproduce the issue? (This is for errors. Please share the command line and the detailed log here.)
Run tao efficientdet_tf2 train -e $SPECS_DIR/spec_train.yaml --gpus 1
Hi!
I’m trying to fine-tune this model.
I was able to convert my COCO dataset to tfrecords using tao, but when I’m trying to run training, I’m getting the following output:
2023-04-11 23:43:22,078 [INFO] root: Registry: ['nvcr.io']
2023-04-11 23:43:22,137 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf2.9.1
2023-04-11 23:43:27.981470: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[1681256610.283774] [ip-172-31-7-160:16 :f] vfs_fuse.c:424 UCX WARN failed to connect to vfs socket '': Invalid argument
2023-04-11 23:43:30,643 [WARNING] matplotlib: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-hxa8n1c4 because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2023-04-11 23:43:30,906 [INFO] matplotlib.font_manager: generated new fontManager
[1681256615.247612] [ip-172-31-7-160:324 :f] vfs_fuse.c:424 UCX WARN failed to connect to vfs socket '': Invalid argument
<frozen common.hydra.hydra_runner>:87: UserWarning:
'spec_train.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
Setting up communication with ClearML server.
ClearML task init failed with error ClearML configuration could not be found (missing `~/clearml.conf` or Environment CLEARML_API_HOST)
To get started with ClearML: setup your own `clearml-server`, or create a free account at https://app.clear.ml
Training will still continue.
Log file already exists at /workspace/tao-experiments/efficientdet_tf2/experiment_dir_unpruned/status.json
Starting efficientdet training.
WARNING:tensorflow:AutoGraph could not transform <function CocoDataset.__call__.<locals>._prefetch_dataset at 0x7f0f2171ef70> and will run it as-is.
Cause: Unable to locate the source code of <function CocoDataset.__call__.<locals>._prefetch_dataset at 0x7f0f2171ef70>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function CocoDataset.__call__.<locals>._prefetch_dataset at 0x7f0f2171ef70> and will run it as-is.
Cause: Unable to locate the source code of <function CocoDataset.__call__.<locals>._prefetch_dataset at 0x7f0f2171ef70>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function CocoDataset.__call__.<locals>.<lambda> at 0x7f0f2282d1f0> and will run it as-is.
Cause: could not parse the source code of <function CocoDataset.__call__.<locals>.<lambda> at 0x7f0f2282d1f0>: no matching AST found among candidates:
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function CocoDataset.__call__.<locals>.<lambda> at 0x7f0f2282d1f0> and will run it as-is.
Cause: could not parse the source code of <function CocoDataset.__call__.<locals>.<lambda> at 0x7f0f2282d1f0>: no matching AST found among candidates:
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
target_size = (416, 416), output_size = (416, 416)
WARNING:tensorflow:AutoGraph could not transform <function CocoDataset.__call__.<locals>.<lambda> at 0x7f0f203d5040> and will run it as-is.
Cause: could not parse the source code of <function CocoDataset.__call__.<locals>.<lambda> at 0x7f0f203d5040>: no matching AST found among candidates:
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function CocoDataset.__call__.<locals>.<lambda> at 0x7f0f203d5040> and will run it as-is.
Cause: could not parse the source code of <function CocoDataset.__call__.<locals>.<lambda> at 0x7f0f203d5040>: no matching AST found among candidates:
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function CocoDataset.__call__.<locals>._prefetch_dataset at 0x7f0f203d5310> and will run it as-is.
Cause: Unable to locate the source code of <function CocoDataset.__call__.<locals>._prefetch_dataset at 0x7f0f203d5310>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function CocoDataset.__call__.<locals>._prefetch_dataset at 0x7f0f203d5310> and will run it as-is.
Cause: Unable to locate the source code of <function CocoDataset.__call__.<locals>._prefetch_dataset at 0x7f0f203d5310>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function CocoDataset.__call__.<locals>.<lambda> at 0x7f0f203d5430> and will run it as-is.
Cause: could not parse the source code of <function CocoDataset.__call__.<locals>.<lambda> at 0x7f0f203d5430>: no matching AST found among candidates:
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function CocoDataset.__call__.<locals>.<lambda> at 0x7f0f203d5430> and will run it as-is.
Cause: could not parse the source code of <function CocoDataset.__call__.<locals>.<lambda> at 0x7f0f203d5430>: no matching AST found among candidates:
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function CocoDataset.__call__.<locals>.<lambda> at 0x7f0f203d5550> and will run it as-is.
Cause: could not parse the source code of <function CocoDataset.__call__.<locals>.<lambda> at 0x7f0f203d5550>: no matching AST found among candidates:
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function CocoDataset.__call__.<locals>.<lambda> at 0x7f0f203d5550> and will run it as-is.
Cause: could not parse the source code of <function CocoDataset.__call__.<locals>.<lambda> at 0x7f0f203d5550>: no matching AST found among candidates:
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
Building unpruned graph...
WARNING:tensorflow:AutoGraph could not transform <bound method ImageResizeLayer.call of <nvidia_tao_tf2.cv.efficientdet.layers.image_resize_layer.ImageResizeLayer object at 0x7f0e90644e80>> and will run it as-is.
Cause: Unable to locate the source code of <bound method ImageResizeLayer.call of <nvidia_tao_tf2.cv.efficientdet.layers.image_resize_layer.ImageResizeLayer object at 0x7f0e90644e80>>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <bound method ImageResizeLayer.call of <nvidia_tao_tf2.cv.efficientdet.layers.image_resize_layer.ImageResizeLayer object at 0x7f0e90644e80>> and will run it as-is.
Cause: Unable to locate the source code of <bound method ImageResizeLayer.call of <nvidia_tao_tf2.cv.efficientdet.layers.image_resize_layer.ImageResizeLayer object at 0x7f0e90644e80>>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <bound method WeightedFusion.call of <nvidia_tao_tf2.cv.efficientdet.layers.weighted_fusion_layer.WeightedFusion object at 0x7f0f2003c880>> and will run it as-is.
Cause: Unable to locate the source code of <bound method WeightedFusion.call of <nvidia_tao_tf2.cv.efficientdet.layers.weighted_fusion_layer.WeightedFusion object at 0x7f0f2003c880>>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <bound method WeightedFusion.call of <nvidia_tao_tf2.cv.efficientdet.layers.weighted_fusion_layer.WeightedFusion object at 0x7f0f2003c880>> and will run it as-is.
Cause: Unable to locate the source code of <bound method WeightedFusion.call of <nvidia_tao_tf2.cv.efficientdet.layers.weighted_fusion_layer.WeightedFusion object at 0x7f0f2003c880>>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
/usr/local/lib/python3.8/dist-packages/keras/backend.py:450: UserWarning: `tf.keras.backend.set_learning_phase` is deprecated and will be removed after 2020-10-11. To update it, simply pass a True/False value to the `training` argument of the `__call__` method of your layer or model.
warnings.warn('`tf.keras.backend.set_learning_phase` is deprecated and '
"The indicated 'retail_detector_100.tlt' artifact does not exist in the '/workspace/tao-experiments/data/efficientdet_tf2/retail_detector_100.tlt' registry"
Error executing job with overrides: []
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 368, in <lambda>
lambda: hydra.run(
File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 88, in _patched_hydra_run
return PatchHydra._original_hydra_run(self, config_name, task_function, overrides, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 110, in run
_ = ret.return_value
File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 170, in _patched_task_function
return task_function(a_config, *a_args, **a_kwargs)
File "<frozen cv.efficientdet.scripts.train>", line 229, in main
File "<frozen common.decorators>", line 76, in _func
File "<frozen common.decorators>", line 49, in _func
File "<frozen cv.efficientdet.scripts.train>", line 108, in run_experiment
File "<frozen cv.efficientdet.utils.helper>", line 61, in decode_eff
File "<frozen eff.core.archive>", line 544, in restore_artifact
KeyError: "The indicated 'retail_detector_100.tlt' artifact does not exist in the '/workspace/tao-experiments/data/efficientdet_tf2/retail_detector_100.tlt' registry"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "</usr/local/lib/python3.8/dist-packages/nvidia_tao_tf2/cv/efficientdet/scripts/train.py>", line 3, in <module>
File "<frozen cv.efficientdet.scripts.train>", line 233, in <module>
File "<frozen common.hydra.hydra_runner>", line 87, in wrapper
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 367, in _run_hydra
run_and_report(
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 251, in run_and_report
assert mdl is not None
AssertionError
Sending telemetry data.
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-04-11 23:43:53,561 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
It looks like it can’t load pre-trained .tlt
model, but I don’t know why (I checked if the file is present in docker, and the answer is yes).
Do you have any suggestions?