Mask-RCNN int8 Version Results in Poor Performance

As suggested, I was able to install ngc. Then, I was able to download the resnet50 weights and train it. However, when I try to train the resnet34 and resnet18 I get the following error:

INFO:tensorflow:Done calling model_fn.
[INFO] Total size of new array must be unchanged for block_1a_conv_1/kernel lh_shape: [(3, 3, 64, 64)], rh_shape: [(1, 1, 64, 64)]
Parsing Inputs...
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 254, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 250, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 237, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 88, in run_executer
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 418, in train_and_eval
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 713, in __init__
    h.begin()
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/hooks/pretrained_restore_hook.py", line 208, in begin
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/hooks/pretrained_restore_hook.py", line 113, in assign_from_checkpoint
ValueError: Total size of new array must be unchanged for block_1a_conv_1/kernel lh_shape: [(3, 3, 64, 64)], rh_shape: [(1, 1, 64, 64)]

Entire output:
log.txt (20.3 KB)

Make sure you download corresponding resnet34 weight if you train resnet34 backbone. Should not use resnet50 weights.

I was able to successfully generate the unpruned int8 model that worked on par with fp32 and fp 16 by utilizing bigger gpu size, and adding “-s” in the export command. Also, I doubled the training images by flipping the original image horizontally and used it for the cal image dir.

However, when I retrained the pruned model and then try to generate int8 model, the int8 model does not detect anything and performs poorly. for both train and test set. In contrast, the retrained model performed on par with the unpruned model.

Command used to generate the pruned int8 model:

%env NUM_STEP=16000
!mkdir -p $LOCAL_EXPERIMENT_DIR/experiment5/export_int
# Remove existing etlt file
# !rm -f $LOCAL_EXPERIMENT_DIR/experiment_dir_retrain/model.step-$NUM_STEP.etlt
!tao mask_rcnn export -m $USER_EXPERIMENT_DIR/experiment5/retrain/fmodel.step-$NUM_STEP.tlt \
                      -k $KEY \
                      -e $SPECS_DIR/wisrd-v0-mask-rcnn_train_resnet50-v5-prune.txt \
                      --batch_size 1 \
                      --data_type int8 \
                      --gpu_index 0 \
                      --engine_file $USER_EXPERIMENT_DIR/experiment5/export_int/trt.int8.engine \
                      --cal_image_dir $DATA_DOWNLOAD_DIR/v1_clean/train-cal/images \
                      --batches 628 \
                      --cal_cache_file $USER_EXPERIMENT_DIR/experiment5/export_int/maskrcnn.cal \
                      --cal_data_file $USER_EXPERIMENT_DIR/experiment5/export_int/maskrcnn.tensorfile \
                      --strict_type_constraints

export output:

export_output.txt (81.6 KB)
pruned spec file:
peuned_spec_file.txt (2.2 KB)

Could you please help me in resolving this error?

How about fp16 engine or fp32 engine?

Both the fp32 and fp16 works on par with the original model.

To narrow down, can you try to change above retrained model to unpruned model and export an int8 model to check what is the result? Please note that please set training images into “--cal_image_dir”.

The int8 model from unpruned version works on par with the original model. Do you mean to try to generate the int8 model from unpruned model again?

I created a separate directory called train-cal, where I am storing my original training images as well as the horizontal flipped images of training set. I am setting the --cal_image_dir to the train-cal directory path.

So, to be clear,

unpruned model + ori training images + export to int8 : works
pruned model + ori training images and horizontal flipped images + export to int8: not work

Is that right?

How about

pruned model + ori training images + export to int8

unpruned model + ori training images and horizontal flipped images + export to int8 : works
pruned model + ori training images and horizontal flipped images + export to int8: not work

pruned model + ori training images + export to int8: does not work
pruned model + ori training images +horizontal flipped images+vertical flipped images+90 degree rotated images+ export to int8: does not work

Let us focus this experiment.

  • Did you keep the pruning log?
  • After pruning, did you run retraining?
  • After retraining, how is the "mask_rcnn evaluate xxx " result with the .tlt model?
  • And when you run “mask_rcnn export xxx”, please keep in mind that the “--cal_image_dir” should be the images you used to run the retraining.
  • Did you keep the pruning log?
    – I regenerated the log file by pruning the model again.
    pruned_log.txt (3.8 KB)

  • After pruning, did you run retraining?
    –Yes, log file:
    pruned_retraining_log.txt (4.7 MB)

  • After retraining, how is the "mask_rcnn evaluate xxx " result with the .tlt model?
    – It was similar to the unpruned model.
    evaluate_pruned_retrained.txt (70.3 KB)

  • And when you run “mask_rcnn export xxx”, please keep in mind that the “--cal_image_dir” should be the images you used to run the retraining.
    – I tried to generate the int8 model again with just the training images that I used for the training of pruned/unpruned model. Still it does not work.

Command used:

%env NUM_STEP=16000
!mkdir -p $LOCAL_EXPERIMENT_DIR/experiment5/export_int
# Remove existing etlt file
# !rm -f $LOCAL_EXPERIMENT_DIR/experiment_dir_retrain/model.step-$NUM_STEP.etlt
!tao mask_rcnn export -m $USER_EXPERIMENT_DIR/experiment5/retrain/fmodel.step-$NUM_STEP.tlt \
                      -k $KEY \
                      -e $SPECS_DIR/wisrd-v0-mask-rcnn_train_resnet50-v5-prune.txt \
                      --batch_size 1 \
                      --data_type int8 \
                      --gpu_index 3 \
                      --engine_file $USER_EXPERIMENT_DIR/experiment5/export_int/trt.int8.engine \
                      --cal_image_dir $DATA_DOWNLOAD_DIR/v1_clean/train/images \
                      --batches 314 \
                      --cal_cache_file $USER_EXPERIMENT_DIR/experiment5/export_int/maskrcnn.cal \
                      --cal_data_file $USER_EXPERIMENT_DIR/experiment5/export_int/maskrcnn.tensorfile \
                      --strict_type_constraints

Output:
export_output.txt (81.6 KB)

Inference:

!tao mask_rcnn inference -i $DATA_DOWNLOAD_DIR/v1_clean/test/images \
                         -o $USER_EXPERIMENT_DIR/experiment5/test_images_pruned_int8-train\
                         -e $SPECS_DIR/wisrd-v0-mask-rcnn_train_resnet50-v5-prune.txt \
                         -m $USER_EXPERIMENT_DIR/experiment5/export_int/trt.int8.engine \
                         -l $USER_EXPERIMENT_DIR/experiment5/e2_wisrd_annotated_labels \
                         -c $SPECS_DIR/wisrd_labels.txt \
                         -t 0.2 \
                         -k $KEY \
                         --include_mask

output:

2022-06-18 01:59:20,200 [INFO] root: Registry: ['nvcr.io']
2022-06-18 01:59:20,466 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
2022-06-18 01:59:20,491 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/vigneshs/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
2022-06-18 05:59:35,360 [INFO] root: Starting MaskRCNN inference.
2022-06-18 05:59:35,360 [INFO] iva.mask_rcnn.utils.spec_loader: Loading specification from /workspace/tao-experiments/mask_rcnn/specs/wisrd-v0-mask-rcnn_train_resnet50-v5-prune.txt
100%|███████████████████████████████████████████| 77/77 [00:13<00:00,  5.54it/s]
2022-06-18 05:59:51,076 [INFO] root: Inference finished successfully.
2022-06-18 01:59:54,309 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I just saw that the inference was really fast, as if it was just reading the image and not processing it through the trained model.

How about running inference against the training images?

Same as the test set, it did not detect anything on any of the image in the training set. Again, the inference was really fast, as it was just reading the image and not processing it through the model.

Could you try a lower threshold?

I reduced it to 0.1 and there is no effect. Still there is no detection in any of the image of any sort. Is it possible for the inference cmd to skip the processing of the image through the model?

No, it is not.

Could you deploy the pruned.etlt model in deepstream or triton-app and run in int8 mode?

Refer to

BTW, would you please share the 314 images with me to reproduce?

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.