Mask-RCNN int8 Version Results in Poor Performance

vigi3093 · June 16, 2022, 5:45am

As suggested, I was able to install ngc. Then, I was able to download the resnet50 weights and train it. However, when I try to train the resnet34 and resnet18 I get the following error:

INFO:tensorflow:Done calling model_fn.
[INFO] Total size of new array must be unchanged for block_1a_conv_1/kernel lh_shape: [(3, 3, 64, 64)], rh_shape: [(1, 1, 64, 64)]
Parsing Inputs...
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 254, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 250, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 237, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 88, in run_executer
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 418, in train_and_eval
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 713, in __init__
    h.begin()
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/hooks/pretrained_restore_hook.py", line 208, in begin
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/hooks/pretrained_restore_hook.py", line 113, in assign_from_checkpoint
ValueError: Total size of new array must be unchanged for block_1a_conv_1/kernel lh_shape: [(3, 3, 64, 64)], rh_shape: [(1, 1, 64, 64)]

Entire output:
log.txt (20.3 KB)

Morganh · June 16, 2022, 6:11am

Make sure you download corresponding resnet34 weight if you train resnet34 backbone. Should not use resnet50 weights.

vigi3093 · June 17, 2022, 7:18am

I was able to successfully generate the unpruned int8 model that worked on par with fp32 and fp 16 by utilizing bigger gpu size, and adding “-s” in the export command. Also, I doubled the training images by flipping the original image horizontally and used it for the cal image dir.

However, when I retrained the pruned model and then try to generate int8 model, the int8 model does not detect anything and performs poorly. for both train and test set. In contrast, the retrained model performed on par with the unpruned model.

Command used to generate the pruned int8 model:

%env NUM_STEP=16000
!mkdir -p $LOCAL_EXPERIMENT_DIR/experiment5/export_int
# Remove existing etlt file
# !rm -f $LOCAL_EXPERIMENT_DIR/experiment_dir_retrain/model.step-$NUM_STEP.etlt
!tao mask_rcnn export -m $USER_EXPERIMENT_DIR/experiment5/retrain/fmodel.step-$NUM_STEP.tlt \
                      -k $KEY \
                      -e $SPECS_DIR/wisrd-v0-mask-rcnn_train_resnet50-v5-prune.txt \
                      --batch_size 1 \
                      --data_type int8 \
                      --gpu_index 0 \
                      --engine_file $USER_EXPERIMENT_DIR/experiment5/export_int/trt.int8.engine \
                      --cal_image_dir $DATA_DOWNLOAD_DIR/v1_clean/train-cal/images \
                      --batches 628 \
                      --cal_cache_file $USER_EXPERIMENT_DIR/experiment5/export_int/maskrcnn.cal \
                      --cal_data_file $USER_EXPERIMENT_DIR/experiment5/export_int/maskrcnn.tensorfile \
                      --strict_type_constraints

export output:

export_output.txt (81.6 KB)
pruned spec file:
peuned_spec_file.txt (2.2 KB)

Could you please help me in resolving this error?

Morganh · June 17, 2022, 7:30am

How about fp16 engine or fp32 engine?

vigi3093 · June 17, 2022, 8:06am

Both the fp32 and fp16 works on par with the original model.

Morganh · June 17, 2022, 8:17am

To narrow down, can you try to change above retrained model to unpruned model and export an int8 model to check what is the result? Please note that please set training images into “--cal_image_dir”.

vigi3093 · June 17, 2022, 8:21am

The int8 model from unpruned version works on par with the original model. Do you mean to try to generate the int8 model from unpruned model again?

I created a separate directory called train-cal, where I am storing my original training images as well as the horizontal flipped images of training set. I am setting the --cal_image_dir to the train-cal directory path.

Morganh · June 17, 2022, 8:34am

So, to be clear,

unpruned model + ori training images + export to int8 : works
pruned model + ori training images and horizontal flipped images + export to int8: not work

Is that right?

How about

pruned model + ori training images + export to int8

vigi3093 · June 17, 2022, 4:39pm

unpruned model + ori training images and horizontal flipped images + export to int8 : works
pruned model + ori training images and horizontal flipped images + export to int8: not work

pruned model + ori training images + export to int8: does not work
pruned model + ori training images +horizontal flipped images+vertical flipped images+90 degree rotated images+ export to int8: does not work

Morganh · June 18, 2022, 3:19am

Let us focus this experiment.

Did you keep the pruning log?
After pruning, did you run retraining?
After retraining, how is the "mask_rcnn evaluate xxx " result with the .tlt model?
And when you run “mask_rcnn export xxx”, please keep in mind that the “--cal_image_dir” should be the images you used to run the retraining.

vigi3093 · June 18, 2022, 6:11am

Did you keep the pruning log?
– I regenerated the log file by pruning the model again.
pruned_log.txt (3.8 KB)
After pruning, did you run retraining?
–Yes, log file:
pruned_retraining_log.txt (4.7 MB)
After retraining, how is the "mask_rcnn evaluate xxx " result with the .tlt model?
– It was similar to the unpruned model.
evaluate_pruned_retrained.txt (70.3 KB)
And when you run “mask_rcnn export xxx”, please keep in mind that the “--cal_image_dir” should be the images you used to run the retraining.
– I tried to generate the int8 model again with just the training images that I used for the training of pruned/unpruned model. Still it does not work.

Command used:

%env NUM_STEP=16000
!mkdir -p $LOCAL_EXPERIMENT_DIR/experiment5/export_int
# Remove existing etlt file
# !rm -f $LOCAL_EXPERIMENT_DIR/experiment_dir_retrain/model.step-$NUM_STEP.etlt
!tao mask_rcnn export -m $USER_EXPERIMENT_DIR/experiment5/retrain/fmodel.step-$NUM_STEP.tlt \
                      -k $KEY \
                      -e $SPECS_DIR/wisrd-v0-mask-rcnn_train_resnet50-v5-prune.txt \
                      --batch_size 1 \
                      --data_type int8 \
                      --gpu_index 3 \
                      --engine_file $USER_EXPERIMENT_DIR/experiment5/export_int/trt.int8.engine \
                      --cal_image_dir $DATA_DOWNLOAD_DIR/v1_clean/train/images \
                      --batches 314 \
                      --cal_cache_file $USER_EXPERIMENT_DIR/experiment5/export_int/maskrcnn.cal \
                      --cal_data_file $USER_EXPERIMENT_DIR/experiment5/export_int/maskrcnn.tensorfile \
                      --strict_type_constraints

Output:
export_output.txt (81.6 KB)

Inference:

!tao mask_rcnn inference -i $DATA_DOWNLOAD_DIR/v1_clean/test/images \
                         -o $USER_EXPERIMENT_DIR/experiment5/test_images_pruned_int8-train\
                         -e $SPECS_DIR/wisrd-v0-mask-rcnn_train_resnet50-v5-prune.txt \
                         -m $USER_EXPERIMENT_DIR/experiment5/export_int/trt.int8.engine \
                         -l $USER_EXPERIMENT_DIR/experiment5/e2_wisrd_annotated_labels \
                         -c $SPECS_DIR/wisrd_labels.txt \
                         -t 0.2 \
                         -k $KEY \
                         --include_mask

output:

2022-06-18 01:59:20,200 [INFO] root: Registry: ['nvcr.io']
2022-06-18 01:59:20,466 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
2022-06-18 01:59:20,491 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/vigneshs/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
2022-06-18 05:59:35,360 [INFO] root: Starting MaskRCNN inference.
2022-06-18 05:59:35,360 [INFO] iva.mask_rcnn.utils.spec_loader: Loading specification from /workspace/tao-experiments/mask_rcnn/specs/wisrd-v0-mask-rcnn_train_resnet50-v5-prune.txt
100%|███████████████████████████████████████████| 77/77 [00:13<00:00,  5.54it/s]
2022-06-18 05:59:51,076 [INFO] root: Inference finished successfully.
2022-06-18 01:59:54,309 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I just saw that the inference was really fast, as if it was just reading the image and not processing it through the trained model.

Morganh · June 18, 2022, 3:44pm

How about running inference against the training images?

vigi3093 · June 18, 2022, 11:58pm

Same as the test set, it did not detect anything on any of the image in the training set. Again, the inference was really fast, as it was just reading the image and not processing it through the model.

Morganh · June 19, 2022, 2:11am

Could you try a lower threshold?

vigi3093 · June 19, 2022, 5:51am

I reduced it to 0.1 and there is no effect. Still there is no detection in any of the image of any sort. Is it possible for the inference cmd to skip the processing of the image through the model?

Morganh · June 20, 2022, 3:51am

No, it is not.

Could you deploy the pruned.etlt model in deepstream or triton-app and run in int8 mode?

Refer to

BTW, would you please share the 314 images with me to reproduce?

yingliu · July 6, 2022, 6:43am

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

system · July 20, 2022, 6:44am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TAO MaskRCNN inference output problem TAO Toolkit	36	1017	November 30, 2023
Issue with tlt.components.docker_handler.docker_handler: Stopping container TAO Toolkit	17	1167	July 4, 2022
Error in TAO-Toolkit while training TAO Toolkit	15	1513	July 6, 2022
LPRNet Error TAO Toolkit	13	230	June 19, 2024
Tao-converter [ERROR] Failed to parse the model, please check the encoding key to make sure its correct TAO Toolkit deepstream	70	1705	July 10, 2023
UffParser: Validator error: block_4c_bn_3/cond/Switch: Unsupported operation _Switch TAO Toolkit tensorrt	38	1371	January 11, 2022
MaskRCNN Input to reshape is a tensor with 3135248 values, but the requested shape has 2691200 TAO Toolkit	38	1124	May 9, 2023
Tao toolkit facenet Error TAO Toolkit	14	1282	March 7, 2022
TAO toolkit happend some .so bug TAO Toolkit tao	19	907	September 9, 2022
Tao model error TAO Toolkit	9	128	October 21, 2024

Mask-RCNN int8 Version Results in Poor Performance

Related topics