Retrain pruned EfficientDet-D0 model leads to error in restoring variables

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
V100
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
EfficientDet-D0
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
TAO 4.0
• Training spec file(If have, please share here)
Attached spec_train.yaml and spec_retrain.yaml
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
spec_train.yaml (2.4 KB)
spec_retrain.yaml (3.1 KB)
efficientdet_d0_error_log.txt (133.5 KB)

EfficientDet-D0 model was trained using the spec_train.yaml on COCO dataset. The mAP scores are reasonable. Following that, the model was pruned using the spec_train.yaml. Model pruning was successful. However, when the pruned model was loaded for retraining, it resulted in the following error. Appreciate any help.

Error snippet:

==================================================================================================
Total params: 3,850,940
Trainable params: 285,264
Non-trainable params: 3,565,676
__________________________________________________________________________________________________
LR schedule method: cosine
Use SGD optimizer
Resume training...
Received incompatible tensor with shape (32,) when attempting to restore variable with shape (40,) and name layer_with_weights-108/bias/.ATTRIBUTES/VARIABLE_VALUE.
Error executing job with overrides: []
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 368, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 88, in _patched_hydra_run
    return PatchHydra._original_hydra_run(self, config_name, task_function, overrides, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 110, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 170, in _patched_task_function
    return task_function(a_config, *a_args, **a_kwargs)
  File "<frozen cv.efficientdet.scripts.train>", line 229, in main
  File "<frozen common.decorators>", line 76, in _func
  File "<frozen common.decorators>", line 49, in _func
  File "<frozen cv.efficientdet.scripts.train>", line 184, in run_experiment
  File "<frozen cv.efficientdet.utils.keras_utils>", line 110, in restore_ckpt
  File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saving/saveable_object_util.py", line 134, in restore
    raise ValueError(
ValueError: Received incompatible tensor with shape (32,) when attempting to restore variable with shape (40,) and name layer_with_weights-108/bias/.ATTRIBUTES/VARIABLE_VALUE.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</usr/local/lib/python3.8/dist-packages/nvidia_tao_tf2/cv/efficientdet/scripts/train.py>", line 3, in <module>
  File "<frozen cv.efficientdet.scripts.train>", line 233, in <module>
  File "<frozen common.hydra.hydra_runner>", line 87, in wrapper
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 367, in _run_hydra
    run_and_report(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 251, in run_and_report
    assert mdl is not None
AssertionError
Sending telemetry data.
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2022-12-23 17:39:08,444 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Could you set a new result folder and retry?
-d <new output_dir>

I changed the results_dir in the spec_retrain.yaml and it worked. Thank you!

Would you provide a bit more explanation on how you arrived at your solution from the logs?

I find that it is resuming training. So I think there is something mismatching in training result folder and retraining result folder.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.