Retrain pruned EfficientDet-D0 model leads to error in restoring variables

sumeeth.nagaraja · December 23, 2022, 5:57pm

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
V100
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
EfficientDet-D0
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
TAO 4.0
• Training spec file(If have, please share here)
Attached spec_train.yaml and spec_retrain.yaml
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
spec_train.yaml (2.4 KB)
spec_retrain.yaml (3.1 KB)
efficientdet_d0_error_log.txt (133.5 KB)

EfficientDet-D0 model was trained using the spec_train.yaml on COCO dataset. The mAP scores are reasonable. Following that, the model was pruned using the spec_train.yaml. Model pruning was successful. However, when the pruned model was loaded for retraining, it resulted in the following error. Appreciate any help.

Error snippet:

==================================================================================================
Total params: 3,850,940
Trainable params: 285,264
Non-trainable params: 3,565,676
__________________________________________________________________________________________________
LR schedule method: cosine
Use SGD optimizer
Resume training...
Received incompatible tensor with shape (32,) when attempting to restore variable with shape (40,) and name layer_with_weights-108/bias/.ATTRIBUTES/VARIABLE_VALUE.
Error executing job with overrides: []
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 368, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 88, in _patched_hydra_run
    return PatchHydra._original_hydra_run(self, config_name, task_function, overrides, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 110, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 170, in _patched_task_function
    return task_function(a_config, *a_args, **a_kwargs)
  File "<frozen cv.efficientdet.scripts.train>", line 229, in main
  File "<frozen common.decorators>", line 76, in _func
  File "<frozen common.decorators>", line 49, in _func
  File "<frozen cv.efficientdet.scripts.train>", line 184, in run_experiment
  File "<frozen cv.efficientdet.utils.keras_utils>", line 110, in restore_ckpt
  File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/saving/saveable_object_util.py", line 134, in restore
    raise ValueError(
ValueError: Received incompatible tensor with shape (32,) when attempting to restore variable with shape (40,) and name layer_with_weights-108/bias/.ATTRIBUTES/VARIABLE_VALUE.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</usr/local/lib/python3.8/dist-packages/nvidia_tao_tf2/cv/efficientdet/scripts/train.py>", line 3, in <module>
  File "<frozen cv.efficientdet.scripts.train>", line 233, in <module>
  File "<frozen common.hydra.hydra_runner>", line 87, in wrapper
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 367, in _run_hydra
    run_and_report(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 251, in run_and_report
    assert mdl is not None
AssertionError
Sending telemetry data.
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2022-12-23 17:39:08,444 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Morganh · December 26, 2022, 8:35am

Could you set a new result folder and retry?
-d <new output_dir>

sumeeth.nagaraja · December 28, 2022, 5:57pm

I changed the results_dir in the spec_retrain.yaml and it worked. Thank you!

Would you provide a bit more explanation on how you arrived at your solution from the logs?

Morganh · December 29, 2022, 1:54am

I find that it is resuming training. So I think there is something mismatching in training result folder and retraining result folder.

system · January 12, 2023, 1:55am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error when trying to retrain yolo_v4 TAO Toolkit	7	1008	October 31, 2022
An error occurred when using TLT3.0 Retrain pruned models TAO Toolkit	4	798	October 12, 2021
Can't export retrained model TAO Toolkit	6	441	October 30, 2023
TAO Classification TF2 AssertionError: Only .hdf5, .tlt, .tltb are supported TAO Toolkit	5	523	January 27, 2023
Can't evaluate pruned model for FasterRCNN TAO Toolkit	7	601	October 12, 2021
QAT model evaluation error TAO Toolkit	6	374	July 19, 2023
Retraining Error after pruning the Mask RCNN model with TAO Toolkit TAO Toolkit tao	5	512	May 10, 2022
TAO dino trianing tensorboard image visualization not working TAO Toolkit	5	92	August 9, 2024
Resume yolo_v4 traing - SoftStartCosineAnnealingScheduler does not support a progress value error TAO Toolkit	2	825	September 18, 2021
Deformable detr model keeps failing to train TAO Toolkit	5	557	February 1, 2024

Retrain pruned EfficientDet-D0 model leads to error in restoring variables

Related topics