Permission Denied Error When training MASK RCNN

subhankar.halder · July 22, 2021, 3:38am

Please provide the following information when requesting support.

• Network Type (Mask_rcnn)
• TLT Version ( 3.0 “docker_tag” :v3.0-py3 )
• Training spec file(If have, please share here)

seed: 123
use_amp: False
warmup_steps: 1000
checkpoint: "/workspace/tlt-experiments/mask_rcnn/pretrained_resnet50/tlt_instance_segmentation_vresnet50/resnet50.hdf5"
learning_rate_steps: "[10000, 15000, 20000]"
learning_rate_decay_levels: "[0.1, 0.02, 0.01]"
total_steps: 25000
train_batch_size: 2
eval_batch_size: 4
num_steps_per_eval: 5000
momentum: 0.9
l2_weight_decay: 0.0001
warmup_learning_rate: 0.0001
init_learning_rate: 0.01

data_config{
    image_size: "(832, 1344)"
    augment_input_data: True
    eval_samples: 500
    training_file_pattern: "/workspace/tlt-experiments/data/train*.tfrecord"
    validation_file_pattern: "/workspace/tlt-experiments/data/val*.tfrecord"
    val_json_file: "/workspace/tlt-experiments/data/annotations/instances_val2017.json"

    # dataset specific parameters
    num_classes: 91
    skip_crowd_during_training: True
}

maskrcnn_config {
    nlayers: 50
    arch: "resnet"
    freeze_bn: True
    freeze_blocks: "[0,1]"
    gt_mask_size: 112
        
    # Region Proposal Network
    rpn_positive_overlap: 0.7
    rpn_negative_overlap: 0.3
    rpn_batch_size_per_im: 256
    rpn_fg_fraction: 0.5
    rpn_min_size: 0.

    # Proposal layer.
    batch_size_per_im: 512
    fg_fraction: 0.25
    fg_thresh: 0.5
    bg_thresh_hi: 0.5
    bg_thresh_lo: 0.

    # Faster-RCNN heads.
    fast_rcnn_mlp_head_dim: 1024
    bbox_reg_weights: "(10., 10., 5., 5.)"

    # Mask-RCNN heads.
    include_mask: True
    mrcnn_resolution: 28

    # training
    train_rpn_pre_nms_topn: 2000
    train_rpn_post_nms_topn: 1000
    train_rpn_nms_threshold: 0.7

    # evaluation
    test_detections_per_image: 100
    test_nms: 0.5
    test_rpn_pre_nms_topn: 1000
    test_rpn_post_nms_topn: 1000
    test_rpn_nms_thresh: 0.7

    # model architecture
    min_level: 2
    max_level: 6
    num_scales: 1
    aspect_ratios: "[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]"
    anchor_scale: 8

    # localization loss
    rpn_box_loss_weight: 1.0
    fast_rcnn_box_loss_weight: 1.0
    mrcnn_weight_loss_mask: 1.0
}

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I am trying to train Mask RCNN using the given jupyter notebook using the following command:

print("For multi-GPU, change --gpus based on your machine.")
!tlt mask_rcnn train -e $SPECS_DIR/maskrcnn_train_resnet50.txt \
                 -d $USER_EXPERIMENT_DIR/experiment_dir_unpruned\
                 -k $KEY \
                 --gpus 1

However I encounter an error soon:

For multi-GPU, change --gpus based on your machine.
2021-07-22 09:00:24,083 [INFO] root: Registry: ['nvcr.io']
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-wiqz1ro5 because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/bin/mask_rcnn", line 8, in <module>
    sys.exit(main())
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/entrypoint/mask_rcnn.py", line 14, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 256, in launch_job
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 47, in get_modules
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/evaluate.py", line 20, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 36, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/hooks/logging_hook.py", line 33, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/utils/logging_backend.py", line 26, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/utils/dllogger_class.py", line 50, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/utils/metaclasses.py", line 29, in __call__
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/utils/dllogger_class.py", line 40, in __init__
  File "/usr/local/lib/python3.6/dist-packages/dllogger/logger.py", line 125, in __init__
    self.file = open(filename, "w")
PermissionError: [Errno 13] Permission denied: 'mrcnn_log.json'
2021-07-22 09:00:29,268 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

All the directories I am working with, have drwxrwxr-x permissions. I cannot find mrcnn_log.json anywhere in the directories. I would be grateful if anyone could help me resolve this.

Morganh · July 22, 2021, 4:20am

To narrow down, can you run the same training command in your host PC instead of inside jupyter notebook?
Please note that changing env variables, such as $SPECS_DIR, $USER_EXPERIMENT_DIR based on your ~/.tlt_mounts.json.

subhankar.halder · July 22, 2021, 5:02am

Hi,

I am getting the same error:

In my terminal I am running the following command. Note I have replaced my key with “KEY” in the snippet but used my actual key in the command:

(nvidia) vast@remote-subhankar:~/tlt$ tlt mask_rcnn train -e /home/vast/tlt/tlt_cv_samples_v1.0.2/mask_rcnn/specs/maskrcnn_train_resnet50.txt -d /home/vast/tlt/mask_rcnn/experiment_dir_unpruned/ -k KEY --gpus 1

I get the following error:

2021-07-22 10:24:09,164 [INFO] root: Registry: ['nvcr.io']                                                                                                  Matplotlib created a temporary config/cache directory at /tmp/matplotlib-nqeafpp9 because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/bin/mask_rcnn", line 8, in <module>
    sys.exit(main())
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runf$les/ai_infra/iva/mask_rcnn/entrypoint/mask_rcnn.py", line 14, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runf$les/ai_infra/iva/common/entrypoint/entrypoint.py", line 256, in launch_job
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runf$les/ai_infra/iva/common/entrypoint/entrypoint.py", line 47, in get_modules
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runf$les/ai_infra/iva/mask_rcnn/scripts/evaluate.py", line 20, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runf$les/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 36, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/hooks/logging_hook.py", line 33, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/utils/logging_backend.py", line 26, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/utils/dllogger_class.py", line 50, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/utils/metaclasses.py", line 29, in __call__
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/utils/dllogger_class.py", line 40, in __init__
  File "/usr/local/lib/python3.6/dist-packages/dllogger/logger.py", line 125, in __init__
    self.file = open(filename, "w")
PermissionError: [Errno 13] Permission denied: 'mrcnn_log.json'
2021-07-22 10:24:14,360 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

subhankar.halder · July 22, 2021, 5:14am

This is my training spec file:

seed: 123
use_amp: False
warmup_steps: 1000
checkpoint: "/home/vast/tlt/mask_rcnn/pretrained_resnet50/tlt_instance_segmentation_vresnet50/resnet50.hdf5"
learning_rate_steps: "[10000, 15000, 20000]"
learning_rate_decay_levels: "[0.1, 0.02, 0.01]"
total_steps: 25000
train_batch_size: 2
eval_batch_size: 4
num_steps_per_eval: 5000
momentum: 0.9
l2_weight_decay: 0.0001
warmup_learning_rate: 0.0001
init_learning_rate: 0.01

data_config{
    image_size: "(832, 1344)"
    augment_input_data: True
    eval_samples: 500
    training_file_pattern: "/home/vast/tlt/data/train*.tfrecord"
    validation_file_pattern: "/home/vast/tlt/data/val*.tfrecord"
    val_json_file: "/home/vast/tlt/data/annotations/instances_val2017.json"

    # dataset specific parameters
    num_classes: 91
    skip_crowd_during_training: True
}

maskrcnn_config {
    nlayers: 50
    arch: "resnet"
    freeze_bn: True
    freeze_blocks: "[0,1]"
    gt_mask_size: 112
        
    # Region Proposal Network
    rpn_positive_overlap: 0.7
    rpn_negative_overlap: 0.3
    rpn_batch_size_per_im: 256
    rpn_fg_fraction: 0.5
    rpn_min_size: 0.

    # Proposal layer.
    batch_size_per_im: 512
    fg_fraction: 0.25
    fg_thresh: 0.5
    bg_thresh_hi: 0.5
    bg_thresh_lo: 0.

    # Faster-RCNN heads.
    fast_rcnn_mlp_head_dim: 1024
    bbox_reg_weights: "(10., 10., 5., 5.)"

    # Mask-RCNN heads.
    include_mask: True
    mrcnn_resolution: 28

    # training
    train_rpn_pre_nms_topn: 2000
    train_rpn_post_nms_topn: 1000
    train_rpn_nms_threshold: 0.7

    # evaluation
    test_detections_per_image: 100
    test_nms: 0.5
    test_rpn_pre_nms_topn: 1000
    test_rpn_post_nms_topn: 1000
    test_rpn_nms_thresh: 0.7

    # model architecture
    min_level: 2
    max_level: 6
    num_scales: 1
    aspect_ratios: "[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]"
    anchor_scale: 8

    # localization loss
    rpn_box_loss_weight: 1.0
    fast_rcnn_box_loss_weight: 1.0
    mrcnn_weight_loss_mask: 1.0
}

Here are my folder permissions:

drwxrwxr-x  3 vast docker 20480 Jul 22 08:07 data/
drwxrwxr-x  5 vast docker  4096 Jul 22 08:10 mask_rcnn/
drwxrwxr-x 21 vast docker  4096 Jun  1 12:55 tlt_cv_samples_v1.0.2/

Morganh · July 22, 2021, 6:46am

Can you share your ~/.tlt_mounts.json ?

subhankar.halder · July 22, 2021, 7:41am

Heres the ~/.tlt_mounts.json

{
    "Mounts": [
        {
            "source": "/home/vast/tlt",
            "destination": "/workspace/tlt-experiments"
        },
        {
            "source": "/home/vast/tlt/tlt_cv_samples_v1.0.2/mask_rcnn/specs",
            "destination": "/workspace/tlt-experiments/mask_rcnn/specs"
        }
    ],
    "DockerOptions": {
        "user": "1000:1000"
    }

Morganh · July 22, 2021, 7:47am

According to your ~/.tlt_mounts.json file, please modify your command from

(nvidia) vast@remote-subhankar:~/tlt$tlt mask_rcnn train -e /home/vast/tlt/tlt_cv_samples_v1.0.2/mask_rcnn/specs/maskrcnn_train_resnet50.txt -d /home/vast/tlt/mask_rcnn/experiment_dir_unpruned/ -k KEY --gpus 1

to

(nvidia) vast@remote-subhankar:~/tlt$ tlt mask_rcnn train -e /workspace/tlt-experiments/tlt_cv_samples_v1.0.2/mask_rcnn/specs/maskrcnn_train_resnet50.txt -d /workspace/tlt-experiments/mask_rcnn/experiment_dir_unpruned/ -k KEY --gpus 1

And also modify the training spec accordingly.

subhankar.halder · July 22, 2021, 8:22am

Made the changes. Still getting the same error:

(nvidia) vast@remote-subhankar:~/tlt$ tlt mask_rcnn train -e /workspace/tlt-experiments/tlt_cv_samples_v1.0.2/mask_rcnn/specs/maskrcnn_train_resnet50.txt -d /workspace/tlt-experiments/mask_rcnn/experiment_dir_unpruned/ -k KEY --gpus 1                                                                                                                                                         2021-07-22 13:48:22,916 [INFO] root: Registry: ['nvcr.io']                                                                                                  Matplotlib created a temporary config/cache directory at /tmp/matplotlib-ew3pjoew because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/bin/mask_rcnn", line 8, in <module>
    sys.exit(main())
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/entrypoint/mask_rcnn.py", line 14, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 256, in launch_job
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 47, in get_modules
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/evaluate.py", line 20, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 36, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/hooks/logging_hook.py", line 33, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/utils/logging_backend.py", line 26, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/utils/dllogger_class.py", line 50, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/utils/metaclasses.py", line 29, in __call__
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/utils/dllogger_class.py", line 40, in __init__
  File "/usr/local/lib/python3.6/dist-packages/dllogger/logger.py", line 125, in __init__
    self.file = open(filename, "w")
PermissionError: [Errno 13] Permission denied: 'mrcnn_log.json'
2021-07-22 13:48:27,969 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Training Spec File

seed: 123
use_amp: False
warmup_steps: 1000
checkpoint: "/workspace/tlt-experiments/mask_rcnn/pretrained_resnet50/tlt_instance_segmentation_vresnet50/resnet50.hdf5"
learning_rate_steps: "[10000, 15000, 20000]"
learning_rate_decay_levels: "[0.1, 0.02, 0.01]"
total_steps: 25000
train_batch_size: 2
eval_batch_size: 4
num_steps_per_eval: 5000
momentum: 0.9
l2_weight_decay: 0.0001
warmup_learning_rate: 0.0001
init_learning_rate: 0.01

data_config{
    image_size: "(832, 1344)"
    augment_input_data: True
    eval_samples: 500
    training_file_pattern: "/workspace/tlt-experiments/data/train*.tfrecord"
    validation_file_pattern: "/workspace/tlt-experiments/data/val*.tfrecord"
    val_json_file: "/workspace/tlt-experiments/data/annotations/instances_val2017.json"

    # dataset specific parameters
    num_classes: 91
    skip_crowd_during_training: True
}

maskrcnn_config {
    nlayers: 50
    arch: "resnet"
    freeze_bn: True
    freeze_blocks: "[0,1]"
    gt_mask_size: 112
        
    # Region Proposal Network
    rpn_positive_overlap: 0.7
    rpn_negative_overlap: 0.3
    rpn_batch_size_per_im: 256
    rpn_fg_fraction: 0.5
    rpn_min_size: 0.

    # Proposal layer.
    batch_size_per_im: 512
    fg_fraction: 0.25
    fg_thresh: 0.5
    bg_thresh_hi: 0.5
    bg_thresh_lo: 0.

    # Faster-RCNN heads.
    fast_rcnn_mlp_head_dim: 1024
    bbox_reg_weights: "(10., 10., 5., 5.)"

    # Mask-RCNN heads.
    include_mask: True
    mrcnn_resolution: 28

    # training
    train_rpn_pre_nms_topn: 2000
    train_rpn_post_nms_topn: 1000
    train_rpn_nms_threshold: 0.7

    # evaluation
    test_detections_per_image: 100
    test_nms: 0.5
    test_rpn_pre_nms_topn: 1000
    test_rpn_post_nms_topn: 1000
    test_rpn_nms_thresh: 0.7

    # model architecture
    min_level: 2
    max_level: 6
    num_scales: 1
    aspect_ratios: "[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]"
    anchor_scale: 8

    # localization loss
    rpn_box_loss_weight: 1.0
    fast_rcnn_box_loss_weight: 1.0
    mrcnn_weight_loss_mask: 1.0
}

Morganh · July 22, 2021, 8:45am

Thanks for the info. I will check further.

Morganh · July 22, 2021, 11:09am

Actually I cannot reproduce the error. It can train successfully.

There is a debug method for you.
Please login the docker.

$ tlt mask_rcnn run /bin/bash

Then run the training.

# mask_rcnn train -e /workspace/tlt-experiments/tlt_cv_samples_v1.0.2/mask_rcnn/specs/maskrcnn_train_resnet50.txt -d /workspace/tlt-experiments/mask_rcnn/experiment_dir_unpruned/ -k KEY --gpus 1

If successful, a json file(mrcnn_log.json) will be saved.

If still failed, your error stops at /usr/local/lib/python3.6/dist-packages/dllogger/logger.py.

So, please vim this file, line125. A json file(mrcnn_log.json) seems to be not able to be saved.
Please check and debug under that directory.

subhankar.halder · July 22, 2021, 12:20pm

Hi,

I solved this issue. I removed the following from the ~/.tlt_mounts.json

    "DockerOptions": {
        "user": "1000:1000"

It threw the following warning, but there was no error and training started:

Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.

Thanks a lot for your help.

system · September 27, 2021, 1:06pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Permission denied: 'mrcnn_log.json' while converting data into tfrecords TAO Toolkit	9	931	August 16, 2022
MaskRCNN Input to reshape is a tensor with 3135248 values, but the requested shape has 2691200 TAO Toolkit	38	1199	May 9, 2023
Mask_rcnn shows training logs loss 0.00000 fast_rcnn class loss: 0.00000 fast_rcnn box loss: 0.00000 TAO Toolkit	22	816	August 16, 2022
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/getting_started_v5.0.0/notebooks/tao_launcher_starter_kit/mask_rcnn/specs/maskrc TAO Toolkit python	44	1631	September 5, 2023
Mask R-CNN Training Jupyter Notebook Model Quality and Multiple GPU changes TAO Toolkit	7	1967	October 12, 2021
TLT train maskrcnn model with Mapillary Vistas Dataset failed on CUDA_ERROR_OUT_OF_MEMORY: out of memory TAO Toolkit cuda	86	3237	September 21, 2021
TLT V2.0 Classification TAO Toolkit	26	2840	August 3, 2021
Train mask-rcnn failure TAO Toolkit tao	16	1196	November 25, 2021
Mask R-CNN hangs during training using custom made tfrecords TAO Toolkit	31	3511	October 12, 2021
Requested more than 0 entries, but params is empty. Params shape: [0,1920,1080] TAO Toolkit	23	1879	March 1, 2022

Permission Denied Error When training MASK RCNN

Related topics