Retraining Error after pruning the Mask RCNN model with TAO Toolkit

Please provide the following information when requesting support.

• Hardware RTX39090
• Network Type Mask RCNN
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
Configuration of the TAO Toolkit Instance
dockers: [‘nvidia/tao/tao-toolkit-tf’, ‘nvidia/tao/tao-toolkit-pyt’, ‘nvidia/tao/tao-toolkit-lm’]
format_version: 2.0
toolkit_version: 3.21.11
published_date: 11/08/2021
• Training spec file(If have, please share here)
the training spec file:
maskrcnn_train_resnet50.txt (2.0 KB)
the re-train spec file:
maskrcnn_retrain_resnet50.txt (2.0 KB)

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
I am trying the mask rcnn instance segmentation with TAO toolkit.
I followed the maskrcnn.ipynb to train with 1 class dataset, the tf_record dataset is generated with COCO format by scripts:
preprocess_dataset.sh (4.5 KB)
create_inviol_tf_record.py (12.9 KB)

First training and evaluation is basic OK.
When I finished the pruning, and started retrain the model , run with:
!tao mask_rcnn train -e $SPECS_DIR/maskrcnn_retrain_resnet50.txt
-d $USER_EXPERIMENT_DIR/experiment_dir_retrain
-k $KEY
–gpus 1

it showed that "
ValueError: Cannot reshape a tensor with 25690112 elements to shape [128,256,14,14] (6422528 elements) for ‘mask_head_reshape_1/mask_head_reshape_1’ (op: ‘Reshape’) with input shapes: [4,128,256,14,14], [4] and with input tensors computed as partial shapes: input[1] = [128,256,14,14]."

the full log file:
log.txt (630 Bytes)
log_retrain_pruned_model.txt (15.7 KB)

the training spec and dataset are the same with previous training.

Confused with this error and any helps would be much appreciated.

Can you change retraining spec and retry?
train_batch_size: 1
eval_batch_size: 1

More, your 1st log is not correct.

> Create EncryptCheckpointSaverHook.
> =================================
>      Start training cycle 01
> =================================
> 
> ***********************
> Loading model graph...
> ***********************
> [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_2/
> [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_3/
> [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_4/
> [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_5/
> [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_6/
> Job finished with an uncaught exception: `FAILURE`

Did you train it successfully?

Thanks, changing batch size to 1 can solve the issue.
So re-train setting should be totally the same with 1st train?

Btw, 2 log files are both for 2nd retrain after pruning. 1st training is successful.

See MaskRCNN - NVIDIA Docs

Once the model has been pruned, there might be a decrease in accuracy. This happens because some previously useful weights may have been removed. To regain accuracy, NVIDIA recommends that you retrain this pruned model over the same dataset. To do this, run the tao mask_rcnn train command with an updated spec file that points to the newly pruned model by setting pruned_model_path .

Users are advised to turn off the regularizer during retraining. You may do this by setting the regularizer weights to 0 for both l1_weight_decay and l2_weight_decay .

The other parameters may be retained in the spec file from the previous training. train_batch_size and eval_batch_size must be kept unchanged.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.