Mask-RCNN int8 Version Results in Poor Performance

I trained the MaskRCNN model with 314 images and tested it on 77 images using the TAO framework. It performs well both quantitatively and qualitatively. However, when I generate the int8 model, it performs significantly worse than the fp32, and fp16 versions.

I generated two calibration files by providing two sets of custom directory of images to tao expert command. In the first directory, I combined both the train and test sets’ images. In the second directory, all the images from the first directory were flipped horizontally to double the amount the calibration data.

Both the files were then used to generate the two int8 engine file using tao converter. Further, both of them performed poorly than the fp32 and fp16 versions. Below are the commands used:

Calibration file generation:

%env NUM_STEP=13000
!mkdir -p $LOCAL_EXPERIMENT_DIR/experiment/export_int

!tao mask_rcnn export -m $USER_EXPERIMENT_DIR/experiment/model.step-$NUM_STEP.tlt \
                      -k $KEY \
                      -o $USER_EXPERIMENT_DIR/experiment/model.step-$NUM_STEP.etlt \
                      -e $SPECS_DIR/mask-rcnn_train_resnet50.txt \
                      --batch_size 1 \
                      --gpu_index 0 \
                      --data_type int8 \
                      --cal_image_dir $USER_EXPERIMENT_DIR/data/v1_clean/train-cal/images \
                      --batches 381 \
                      --cal_cache_file $USER_EXPERIMENT_DIR/experiment/export_int/maskrcnnv.cal \
                      --cal_data_file $USER_EXPERIMENT_DIR/experiment/export_int/maskrcnn.tensorfile

Engine file generation:

!tao converter -k $KEY  \
                   -d 3,832,1344 \
                   -o generate_detections,mask_fcn_logits/BiasAdd \
                   -c /workspace/tao-experiments/mask_rcnn/experiment/export_int/maskrcnnv.cal \
                   -e $USER_EXPERIMENT_DIR/experiment/export_int/trt.int8.engine \
                   -b 1 \
                   -m 1 \
                   -t int8 \
                   -i nchw \
                   -s \
                   $USER_EXPERIMENT_DIR/experiment/model.step-$NUM_STEP.etlt

Inference file generation:

!tao mask_rcnn inference -i $DATA_DOWNLOAD_DIR/v1_clean/test/images \
                         -o $USER_EXPERIMENT_DIR/experiment/test_predicted_images_int8 \
                         -e $SPECS_DIR/mask-rcnn_train_resnet50.txt \
                         -m $USER_EXPERIMENT_DIR/experiment/export_int/trt.int8.engine \
                         -l $USER_EXPERIMENT_DIR/experiment/annotated_labels \
                         -c $SPECS_DIR/abels.txt \
                         -t 0.2 \
                         -k $KEY \
                         --include_mask

Other Information:

• Hardware - NVIDIA GeForce RTX 2080 Ti
• Network Type - Mask_rcnn
• toolkit_version- 3.22.02
• Training spec file:

seed: 123
use_amp: False
warmup_steps: 1000
checkpoint: "/workspace/tao-experiments/mask_rcnn/pretrained_resnet50/pretrained_instance_segmentation_vresnet50/resnet50.hdf5"
learning_rate_steps: "[10000, 15000, 20000]"
learning_rate_decay_levels: "[0.1, 0.01, 0.001]"
total_steps: 25000
train_batch_size: 2
eval_batch_size: 4
num_steps_per_eval: 1000
momentum: 0.9
l2_weight_decay: 0.0001
warmup_learning_rate: 0.0001
init_learning_rate: 0.01


data_config{
    image_size: "(832, 1344)"  # “(height -1080, width-1920)” 
    augment_input_data: True
    eval_samples: 20
    training_file_pattern: "/workspace/tao-experiments/TF_data/train*.tfrecord"
    validation_file_pattern: "/workspace/tao-experiments/TF_data/val*.tfrecord"
    val_json_file: "/workspace/tao-experiments/data/v1_clean/test/annotations/instances_default.json"

    # dataset specific parameters
    num_classes: 5
    skip_crowd_during_training: True
}

maskrcnn_config {
    nlayers: 50
    arch: "resnet"
    freeze_bn: True
    freeze_blocks: "[0,1]"
    gt_mask_size: 112
        
    # Region Proposal Network
    rpn_positive_overlap: 0.7
    rpn_negative_overlap: 0.3
    rpn_batch_size_per_im: 256
    rpn_fg_fraction: 0.5
    rpn_min_size: 0.

    # Proposal layer.
    batch_size_per_im: 512
    fg_fraction: 0.25
    fg_thresh: 0.5
    bg_thresh_hi: 0.5
    bg_thresh_lo: 0.

    # Faster-RCNN heads.
    fast_rcnn_mlp_head_dim: 1024
    bbox_reg_weights: "(10., 10., 5., 5.)"

    # Mask-RCNN heads.
    include_mask: True
    mrcnn_resolution: 28

    # training
    train_rpn_pre_nms_topn: 2000
    train_rpn_post_nms_topn: 1000
    train_rpn_nms_threshold: 0.7

    # evaluation
    test_detections_per_image: 100
    test_nms: 0.5
    test_rpn_pre_nms_topn: 1000
    test_rpn_post_nms_topn: 1000
    test_rpn_nms_thresh: 0.7

    # model architecture
    min_level: 2
    max_level: 6
    num_scales: 1
    aspect_ratios: "[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]"
    anchor_scale: 8

    # localization loss
    rpn_box_loss_weight: 1.0
    fast_rcnn_box_loss_weight: 1.0
    mrcnn_weight_loss_mask: 1.0
}


How do I get better results with int8 model that could be comparable to fp32 and fp16?

For cal_image_dir, Please use training images as much as possible.

As suggested, I only used training images and the result was better than before, but still far behind fp32 and fp16.
When executing tao converter command, I get two common warnings:

  • [WARNING] Missing scale and zero-point for tensor mask_fcn_logits/bias, expect fall back to non-int8 implementation for any layer consuming or producing given tensor

  • [WARNING] No implementation of layer pyramid_crop_and_resize_mask obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.

Detail output of tao converter command:

tao convert output.txt (69.9 KB)

Is the performance drop may be due to these warnings? Is there any other way to improve the accuracy of int8?

Please run below experiments.

  1. Still use tao-converter to generate int8 tensorrt engine. But add “-s” in the commandline. To check if help improve.
  2. Still use tao-converter, and change to lower threshold “-t” .
  3. Not use tao-converter, please directly use mask_rcnn export, add "--engine_file xxx " to generate fp16 or int8 tensorrt engine. Refer to MaskRCNN — TAO Toolkit 3.22.05 documentation

I am not able to run any experiment. Now any tao command I run it just stops without any error message.

Executing tao export output:

env: NUM_STEP=11000
2022-06-14 14:42:20,899 [INFO] root: Registry: ['nvcr.io']
2022-06-14 14:42:21,131 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2022-06-14 14:42:21,190 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/vignesh/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2022-06-14 14:42:55,570 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Executing tao mask_rcnn train output:


2022-06-14 14:43:37,170 [INFO] root: Registry: ['nvcr.io']
2022-06-14 14:43:37,396 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2022-06-14 14:43:37,457 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/vignesh/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2022-06-14 14:44:11,406 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Please run inside the tao docker and debug what is happening.
Step:
$ tao mask_rcnn run /bin/bash
# mask_rcnn export xxx

Your latest issue may be related to Error running tao container image - #3 by Morganh . We’re checking.

Thanks for letting me know. Even !tao mask_rcnn run bash $SPECS_DIR/download_and_preprocess_coco.sh $DATA_DOWNLOAD_DIR is behaving similarly. It downloaded the nvcr.io/nvidia/tao/tao-toolkit-tf image but did not generate the tf records.

For workaround, please try below in terminal.
$ docker run --runtime=nvidia -it --rm --entrypoint “” -v yourlocaldolder:dockerfolder nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 /bin/bash

How do I run the mask-rcnn notebook?

After executing the command, I am getting the following error:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: ““””: executable file not found in $PATH: unknown.

Command:
docker run --runtime=nvidia -it --rm --entrypoint “” -v /home/vignesh/cv_samples_v1.3.0/mask_rcnn:/home/vignesh/dockertest nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 /bin/bash

/home/vignesh/dockertest - is a folder I created.
/home/vignesh/cv_samples_v1.3.0/mask_rcnn - is the folder that I am currently working on

For triggering jupyter notebook, see below example,

$ docker run --runtime=nvidia -it --rm --entrypoint “” -v ~/demo_3.0:/workspace/demo_3.0 -p 8888:8888 nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 /bin/bash

then,

root@8d6c08489e41:/workspace# jupyter notebook --ip 0.0.0.0 --allow-root

As suggested, I ran the command below command

docker run --runtime=nvidia -it --rm --entrypoint “” -v ~/cv_samples_v1.3.0:/workspace/cv_samples_v1.3.0 -p 5050:5050 nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 /bin/bash/

and I got this error:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "“”": executable file not found in $PATH: unknown.

Entire output:


Unable to find image 'nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3' locally
v3.22.05-tf1.15.5-py3: Pulling from nvidia/tao/tao-toolkit-tf
08c01a0ec47e: Pulling fs layer 
5eedfe82c26b: Pulling fs layer 
0bd41856d1bc: Pulling fs layer 
2514f64065e8: Waiting 
2c5671971cd1: Waiting 
0627d99a5c7e: Waiting 
ec1508216c18: Waiting 
749074f382f7: Waiting 
a7e6794ed569: Waiting 
4f1ad9e2a154: Waiting 
09cd2bfa6cba: Waiting 
9da140dedb4f: Waiting 
2935286a91f6: Waiting 
6d95076351f1: Waiting 
0b91baa3a20c: Waiting 
63e2a3fbe983: Waiting 
5bf0491e6559: Pulling fs layer 
358e14a58438: Waiting 
158c177326bc: Waiting 
bd431ca64c2b: Waiting 
bdc1ace6bb45: Waiting 
4d2198b59862: Waiting 
2340310900ca: Pulling fs layer 
4f4fb700ef54: Pulling fs layer 
0f2cad498734: Waiting 
403f74a15a6a: Pull complete 
547f7ab5d754: Pull complete 
d9858ab1af34: Pull complete 
7acd1e2c2b65: Pull complete 
bbcf31c0cd17: Pull complete 
b4269fe350fe: Pull complete 
6f53ce1c9b6d: Pull complete 
8a1bbd4e32e7: Pull complete 
9aa9043e81f1: Pull complete 
990420fab50f: Pull complete 
a7d1af7a3a86: Pull complete 
81141bc7dae5: Pull complete 
f57c777989b7: Pull complete 
ffc5cdccb97f: Pull complete 
b13935a1e0e0: Pull complete 
d5a7c7e25613: Pull complete 
fb9ca482817e: Pull complete 
0fc99a776cc9: Pull complete 
36825c638f0f: Pull complete 
fe047564aedb: Pull complete 
054c2a66e234: Pull complete 
4cf28455d703: Pull complete 
e1a2348dfe76: Pull complete 
e93f945c459a: Pull complete 
ed308348be3b: Pull complete 
63501d0f539b: Pull complete 
f1875cba28fa: Pull complete 
89a149c0ad42: Pull complete 
abee81cb3fd6: Pull complete 
e529b3ab144e: Pull complete 
72792715e03d: Pull complete 
e1d7a09b4e17: Pull complete 
cd6887922c54: Pull complete 
abaa1257a856: Pull complete 
Digest: sha256:4ce5f8ff41dd1334e940de3c56f4d60658058015c179177a52c83f9e19bf9912
Status: Downloaded newer image for nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "“”": executable file not found in $PATH: unknown.

Hi,
Try to run below in terminal. The workaround is to add
-- entrypoint ""
For example,
$ docker run --runtime=nvidia -it --rm --entrypoint "" nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 /bin/bash

The 2nd workaround is for uses who want to use tao launcher instead of “docker run”.
Step:

  1. Add "entrypoint": "" to ~/.tao_mounts.json
    "DockerOptions":{
          "entrypoint": "" ,
          "shm_size": "16G",
  1. Modify lib/python3.6/site-packages/tao/components/docker_handler/docker_handler.py . This file should be available when you install nvidia-tao.

VALID_DOCKER_ARGS = [“user”, “ports”, “shm_size”, “ulimits”, “privileged”, “network”]

to

VALID_DOCKER_ARGS = [“user”, “ports”, “shm_size”, “ulimits”, “privileged”, “network”, “entrypoint”]

I was able to run the suggested docker command and was able to run the notebook too. However, when I ran the below command :

# Create local dir
!mkdir -p $LOCAL_DATA_DIR
!mkdir -p $LOCAL_EXPERIMENT_DIR
# Download and preprocess data
!tao mask_rcnn run bash $SPECS_DIR/wisrd_data_convert.sh $DATA_DOWNLOAD_DIR

I encountered this error:

2022-06-15 06:59:13,705 [INFO] root: Registry: ['nvcr.io']
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1291, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1337, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1286, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1046, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 984, in send
    self.connect()
  File "/usr/local/lib/python3.6/dist-packages/docker/transport/unixconn.py", line 43, in connect
    sock.connect(self.unix_socket)
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 450, in send
    timeout=timeout
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py", line 532, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1291, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1337, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1286, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1046, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 984, in send
    self.connect()
  File "/usr/local/lib/python3.6/dist-packages/docker/transport/unixconn.py", line 43, in connect
    sock.connect(self.unix_socket)
urllib3.exceptions.ProtocolError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 205, in _retrieve_server_version
    return self.version(api_version=False)["ApiVersion"]
  File "/usr/local/lib/python3.6/dist-packages/docker/api/daemon.py", line 181, in version
    return self._result(self._get(url), json=True)
  File "/usr/local/lib/python3.6/dist-packages/docker/utils/decorators.py", line 46, in inner
    return f(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 228, in _get
    return self.get(url, **self._set_request_timeout(kwargs))
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 542, in get
    return self.request('GET', url, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 529, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 645, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 501, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/tao", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/tlt/entrypoint/entrypoint.py", line 115, in main
    args[1:]
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/local_instance.py", line 297, in launch_command
    docker_handler = self.handler_map[
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/local_instance.py", line 152, in handler_map
    docker_mount_file=os.getenv("LAUNCHER_MOUNTS", DOCKER_MOUNT_FILE)
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/docker_handler/docker_handler.py", line 62, in __init__
    self._docker_client = docker.from_env()
  File "/usr/local/lib/python3.6/dist-packages/docker/client.py", line 85, in from_env
    timeout=timeout, version=version, **kwargs_from_env(**kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/client.py", line 40, in __init__
    self.api = APIClient(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 188, in __init__
    self._version = self._retrieve_server_version()
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 213, in _retrieve_server_version
    'Error while fetching server API version: {0}'.format(e)
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

Could you please help me resolve this error?

I assume you are using 1st workaround. In this case, please ignore tao launcher. Just use
! mask_rcnn run bash $SPECS_DIR/wisrd_data_convert.sh $DATA_DOWNLOAD_DIR

I suggest you to use 2nd workaround as mentioned above.

For the 2nd workaround, I added the entrypoint to ~/tao_mounts.json file as below:

{
    "Mounts": [
        {
            "source": "wisrd_v0",
            "destination": "/workspace/tao-experiments"
        },
        {
            "source": "/home/vignesh/cv_samples_v1.3.0/mask_rcnn/specs",
            "destination": "/workspace/tao-experiments/mask_rcnn/specs"
        }
    ],
    "DockerOptions": {
        "entrypoint": "",
        "shm_size": "16G"
    }
}

For the 2nd step: I have installed nvidia-tao and I am not able to find the tao folder inside the site-packages. However, I was able to find nvidia_tao-0.1.23.dist-info folder but it did not any other folder.

FYI -I am using conda virtual environment so my path is /home/vignesh/anaconda3/envs/wisrd/lib/python3.6/site-packages

Please search docker_handler.py.

I was able to run the notebook successfully. However, I am not able to run the !ngc related command. The error message that I get is below:

!ngc registry model list nvidia/tao/pretrained_instance_segmentation:*

I installed the NGC based on the NVIDIA-NGC document, but when I am running the ngc config set, I am getting the below error:
-bash: /home/vigneshs/downloads/ngc-cli/ngc: cannot execute binary file: Exec format error

Could you help me resolve this error?

Did you download the correct version of ngc? It should be amd64 linux version if your platform is amd64 based.