Error retraining the pruned Mask RCNN model with TAO Toolkit

• Hardware: RTX 3090
• Network Type: Mask RCNN
• TAO version
format_version: 3.0
toolkit_version: 5.0.0
published_date: 07/14/2023

• Training spec file
Training:
maskrcnn_resnet.txt (2.1 KB)
Pruning:
maskrcnn_resnet_prune.txt (1.9 KB)

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Training is working fine, but when i try to retrain my pruned model, I get the error:
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/executer/distributed_executer.py”, line 336, in
trainable_ckpts = [int(item.split(‘.’)[1].split(‘-’)[1])
IndexError: list index out of range

Command for prune:

tao model mask_rcnn prune -m ${OUTPUT_DIR}/model.epoch-40.tlt \
                     -k $KEY \
                     -o ${OUTPUT_DIR}/prune \
                     -eq union \
                     -pg 8 \
                     -nf 16 \
                     -pth 0.1
                     --log_file ${OUTPUT_DIR}/prune/prune_log_file.txt

Command for retraining:

tao model mask_rcnn train -e ${OUTPUT_DIR}/prune/maskrcnn_resnet_prune.txt \
    -d ${OUTPUT_DIR}/prune \
    -k $KEY \
    --gpus 1 \
    --log_file ${OUTPUT_DIR}/prune/log_file.txt

log file for prune:
prune_log_file.txt (6.2 KB)
log file for retrain:
log_file.txt (2.9 KB)

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

The error log comes from tao_tensorflow1_backend/nvidia_tao_tf1/cv/mask_rcnn/executer/distributed_executer.py at main · NVIDIA/tao_tensorflow1_backend · GitHub

May I know that if you can reproduce with official notebook? tao_tutorials/notebooks/tao_launcher_starter_kit/mask_rcnn/maskrcnn.ipynb at main · NVIDIA/tao_tutorials · GitHub.

Since tao is open sourced, actually you can try to debug inside the docker.
$ tao model mask_rcnn run /bin/bash
Then find the source code under /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn

Inside the docker, you can run commands without tao model in the beginning. For example,
# mask_rcnn prune xxx
# mask_rcnn train xxx

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.