I’m trying to train Mask R-CNN using a custom dataset that I have made and converted to coco format. My custom dataset only has one class. The training just stops suddenly after saving model.step-0.tlt without any error or warning. It is important to note that the training worked when I used the defualt coco dataset with 91 classes. But this problem occured when I used my custom dataset. I tried my custom dataset on multiple models including Mask RCNN from matterport (GitHub - matterport/Mask_RCNN: Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow) and it works.
Command:
!tao mask_rcnn train -e $SPECS_DIR/maskrcnn_train_resnet50.txt
-d $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
–gpus 2
Log:
For multi-GPU, change --gpus based on your machine.
2022-01-27 09:02:37,789 [INFO] root: Registry: [‘nvcr.io’]
2022-01-27 09:02:37,860 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2022-01-27 09:02:37,894 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/abdulmajeed/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
[INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet50.txt
Using TensorFlow backend.
[INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet50.txt
[MaskRCNN] INFO : Horovod successfully initialized …
INFO:tensorflow:Using config: {‘_model_dir’: ‘/tmp/tmpj4v4y22o’, ‘_tf_random_seed’: 123, ‘_save_summary_steps’: None, ‘_save_checkpoints_steps’: None, ‘_save_checkpoints_secs’: None, ‘_session_config’: intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 8
gpu_options {
allow_growth: true
force_gpu_compatible: true
}
allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: TWO
}
}
, ‘_keep_checkpoint_max’: 20, ‘_keep_checkpoint_every_n_hours’: None, ‘_log_step_count_steps’: None, ‘_train_distribute’: None, ‘_device_fn’: None, ‘_protocol’: None, ‘_eval_distribute’: None, ‘_experimental_distribute’: None, ‘_experimental_max_worker_delay_secs’: None, ‘_session_creation_timeout_secs’: 7200, ‘_service’: None, ‘_cluster_spec’: <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f246dd7c668>, ‘_task_type’: ‘worker’, ‘_task_id’: 0, ‘_global_id_in_cluster’: 0, ‘_master’: ‘’, ‘_evaluation_master’: ‘’, ‘_is_chief’: True, ‘_num_ps_replicas’: 0, ‘_num_worker_replicas’: 1}
INFO:tensorflow:Using config: {‘_model_dir’: ‘/tmp/tmpu5l3beqy’, ‘_tf_random_seed’: 124, ‘_save_summary_steps’: None, ‘_save_checkpoints_steps’: None, ‘_save_checkpoints_secs’: None, ‘_session_config’: intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 8
gpu_options {
allow_growth: true
force_gpu_compatible: true
}
allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: TWO
}
}
, ‘_keep_checkpoint_max’: 20, ‘_keep_checkpoint_every_n_hours’: None, ‘_log_step_count_steps’: None, ‘_train_distribute’: None, ‘_device_fn’: None, ‘_protocol’: None, ‘_eval_distribute’: None, ‘_experimental_distribute’: None, ‘_experimental_max_worker_delay_secs’: None, ‘_session_creation_timeout_secs’: 7200, ‘_service’: None, ‘_cluster_spec’: <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f81b1b3a390>, ‘_task_type’: ‘worker’, ‘_task_id’: 0, ‘_global_id_in_cluster’: 0, ‘_master’: ‘’, ‘_evaluation_master’: ‘’, ‘_is_chief’: True, ‘_num_ps_replicas’: 0, ‘_num_worker_replicas’: 1}
[MaskRCNN] INFO : Loading pretrained model…
INFO:tensorflow:Done calling model_fn.
[MaskRCNN] WARNING : Checkpoint is missing variable [l2/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l2/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [l3/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l3/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [l4/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l4/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [l5/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l5/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d2/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d2/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d3/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d3/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d4/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d4/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d5/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d5/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-class/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-class/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-box/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-box/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc6/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc6/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc7/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc7/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [class-predict/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [class-predict/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [box-predict/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [box-predict/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l0/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l0/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l1/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l1/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l2/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l2/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l3/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l3/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [conv5-mask/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [conv5-mask/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_fcn_logits/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_fcn_logits/bias]
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
[MaskRCNN] INFO : ============================ GIT REPOSITORY ============================
[MaskRCNN] INFO : BRANCH NAME:
[MaskRCNN] INFO : %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
[MaskRCNN] INFO : ============================ MODEL STATISTICS ===========================
[MaskRCNN] INFO : # Model Weights: 28,558,811
[MaskRCNN] INFO : # Trainable Weights: 43,975,515
[MaskRCNN] INFO : %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
[MaskRCNN] INFO : ============================ TRAINABLE VARIABLES ========================
[MaskRCNN] INFO : [#0001] conv1/kernel:0 => (7, 7, 3, 64)
[MaskRCNN] INFO : [#0002] bn_conv1/gamma:0 => (64,)
[MaskRCNN] INFO : [#0003] bn_conv1/beta:0 => (64,)
[MaskRCNN] INFO : [#0004] block_1a_conv_1/kernel:0 => (1, 1, 64, 64)
[MaskRCNN] INFO : [#0005] block_1a_bn_1/gamma:0 => (64,)
[MaskRCNN] INFO : [#0006] block_1a_bn_1/beta:0 => (64,)
[MaskRCNN] INFO : [#0007] block_1a_conv_2/kernel:0 => (3, 3, 64, 64)
[MaskRCNN] INFO : [#0008] block_1a_bn_2/gamma:0 => (64,)
[MaskRCNN] INFO : [#0009] block_1a_bn_2/beta:0 => (64,)
[MaskRCNN] INFO : [#0010] block_1a_conv_3/kernel:0 => (1, 1, 64, 256)
[MaskRCNN] INFO : [#0011] block_1a_bn_3/gamma:0 => (256,)
[MaskRCNN] INFO : [#0012] block_1a_bn_3/beta:0 => (256,)
[MaskRCNN] INFO : [#0013] block_1a_conv_shortcut/kernel:0 => (1, 1, 64, 256)
[MaskRCNN] INFO : [#0014] block_1a_bn_shortcut/gamma:0 => (256,)
[MaskRCNN] INFO : [#0015] block_1a_bn_shortcut/beta:0 => (256,)
[MaskRCNN] INFO : [#0016] block_1b_conv_1/kernel:0 => (1, 1, 256, 64)
[MaskRCNN] INFO : [#0017] block_1b_bn_1/gamma:0 => (64,)
[MaskRCNN] INFO : [#0018] block_1b_bn_1/beta:0 => (64,)
[MaskRCNN] INFO : [#0019] block_1b_conv_2/kernel:0 => (3, 3, 64, 64)
[MaskRCNN] INFO : [#0020] block_1b_bn_2/gamma:0 => (64,)
[MaskRCNN] INFO : [#0021] block_1b_bn_2/beta:0 => (64,)
[MaskRCNN] INFO : [#0022] block_1b_conv_3/kernel:0 => (1, 1, 64, 256)
[MaskRCNN] INFO : [#0023] block_1b_bn_3/gamma:0 => (256,)
[MaskRCNN] INFO : [#0024] block_1b_bn_3/beta:0 => (256,)
[MaskRCNN] INFO : [#0025] block_1c_conv_1/kernel:0 => (1, 1, 256, 64)
[MaskRCNN] INFO : [#0026] block_1c_bn_1/gamma:0 => (64,)
[MaskRCNN] INFO : [#0027] block_1c_bn_1/beta:0 => (64,)
[MaskRCNN] INFO : [#0028] block_1c_conv_2/kernel:0 => (3, 3, 64, 64)
[MaskRCNN] INFO : [#0029] block_1c_bn_2/gamma:0 => (64,)
[MaskRCNN] INFO : [#0030] block_1c_bn_2/beta:0 => (64,)
[MaskRCNN] INFO : [#0031] block_1c_conv_3/kernel:0 => (1, 1, 64, 256)
[MaskRCNN] INFO : [#0032] block_1c_bn_3/gamma:0 => (256,)
[MaskRCNN] INFO : [#0033] block_1c_bn_3/beta:0 => (256,)
[MaskRCNN] INFO : [#0034] block_2a_conv_1/kernel:0 => (1, 1, 256, 128)
[MaskRCNN] INFO : [#0035] block_2a_bn_1/gamma:0 => (128,)
[MaskRCNN] INFO : [#0036] block_2a_bn_1/beta:0 => (128,)
[MaskRCNN] INFO : [#0037] block_2a_conv_2/kernel:0 => (3, 3, 128, 128)
[MaskRCNN] INFO : [#0038] block_2a_bn_2/gamma:0 => (128,)
[MaskRCNN] INFO : [#0039] block_2a_bn_2/beta:0 => (128,)
[MaskRCNN] INFO : [#0040] block_2a_conv_3/kernel:0 => (1, 1, 128, 512)
[MaskRCNN] INFO : [#0041] block_2a_bn_3/gamma:0 => (512,)
[MaskRCNN] INFO : [#0042] block_2a_bn_3/beta:0 => (512,)
[MaskRCNN] INFO : [#0043] block_2a_conv_shortcut/kernel:0 => (1, 1, 256, 512)
[MaskRCNN] INFO : [#0044] block_2a_bn_shortcut/gamma:0 => (512,)
[MaskRCNN] INFO : [#0045] block_2a_bn_shortcut/beta:0 => (512,)
[MaskRCNN] INFO : [#0046] block_2b_conv_1/kernel:0 => (1, 1, 512, 128)
[MaskRCNN] INFO : [#0047] block_2b_bn_1/gamma:0 => (128,)
[MaskRCNN] INFO : [#0048] block_2b_bn_1/beta:0 => (128,)
[MaskRCNN] INFO : [#0049] block_2b_conv_2/kernel:0 => (3, 3, 128, 128)
[MaskRCNN] INFO : [#0050] block_2b_bn_2/gamma:0 => (128,)
[MaskRCNN] INFO : [#0051] block_2b_bn_2/beta:0 => (128,)
[MaskRCNN] INFO : [#0052] block_2b_conv_3/kernel:0 => (1, 1, 128, 512)
[MaskRCNN] INFO : [#0053] block_2b_bn_3/gamma:0 => (512,)
[MaskRCNN] INFO : [#0054] block_2b_bn_3/beta:0 => (512,)
[MaskRCNN] INFO : [#0055] block_2c_conv_1/kernel:0 => (1, 1, 512, 128)
[MaskRCNN] INFO : [#0056] block_2c_bn_1/gamma:0 => (128,)
[MaskRCNN] INFO : [#0057] block_2c_bn_1/beta:0 => (128,)
[MaskRCNN] INFO : [#0058] block_2c_conv_2/kernel:0 => (3, 3, 128, 128)
[MaskRCNN] INFO : [#0059] block_2c_bn_2/gamma:0 => (128,)
[MaskRCNN] INFO : [#0060] block_2c_bn_2/beta:0 => (128,)
[MaskRCNN] INFO : [#0061] block_2c_conv_3/kernel:0 => (1, 1, 128, 512)
[MaskRCNN] INFO : [#0062] block_2c_bn_3/gamma:0 => (512,)
[MaskRCNN] INFO : [#0063] block_2c_bn_3/beta:0 => (512,)
[MaskRCNN] INFO : [#0064] block_2d_conv_1/kernel:0 => (1, 1, 512, 128)
[MaskRCNN] INFO : [#0065] block_2d_bn_1/gamma:0 => (128,)
[MaskRCNN] INFO : [#0066] block_2d_bn_1/beta:0 => (128,)
[MaskRCNN] INFO : [#0067] block_2d_conv_2/kernel:0 => (3, 3, 128, 128)
[MaskRCNN] INFO : [#0068] block_2d_bn_2/gamma:0 => (128,)
[MaskRCNN] INFO : [#0069] block_2d_bn_2/beta:0 => (128,)
[MaskRCNN] INFO : [#0070] block_2d_conv_3/kernel:0 => (1, 1, 128, 512)
[MaskRCNN] INFO : [#0071] block_2d_bn_3/gamma:0 => (512,)
[MaskRCNN] INFO : [#0072] block_2d_bn_3/beta:0 => (512,)
[MaskRCNN] INFO : [#0073] block_3a_conv_1/kernel:0 => (1, 1, 512, 256)
[MaskRCNN] INFO : [#0074] block_3a_bn_1/gamma:0 => (256,)
[MaskRCNN] INFO : [#0075] block_3a_bn_1/beta:0 => (256,)
[MaskRCNN] INFO : [#0076] block_3a_conv_2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0077] block_3a_bn_2/gamma:0 => (256,)
[MaskRCNN] INFO : [#0078] block_3a_bn_2/beta:0 => (256,)
[MaskRCNN] INFO : [#0079] block_3a_conv_3/kernel:0 => (1, 1, 256, 1024)
[MaskRCNN] INFO : [#0080] block_3a_bn_3/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0081] block_3a_bn_3/beta:0 => (1024,)
[MaskRCNN] INFO : [#0082] block_3a_conv_shortcut/kernel:0 => (1, 1, 512, 1024)
[MaskRCNN] INFO : [#0083] block_3a_bn_shortcut/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0084] block_3a_bn_shortcut/beta:0 => (1024,)
[MaskRCNN] INFO : [#0085] block_3b_conv_1/kernel:0 => (1, 1, 1024, 256)
[MaskRCNN] INFO : [#0086] block_3b_bn_1/gamma:0 => (256,)
[MaskRCNN] INFO : [#0087] block_3b_bn_1/beta:0 => (256,)
[MaskRCNN] INFO : [#0088] block_3b_conv_2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0089] block_3b_bn_2/gamma:0 => (256,)
[MaskRCNN] INFO : [#0090] block_3b_bn_2/beta:0 => (256,)
[MaskRCNN] INFO : [#0091] block_3b_conv_3/kernel:0 => (1, 1, 256, 1024)
[MaskRCNN] INFO : [#0092] block_3b_bn_3/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0093] block_3b_bn_3/beta:0 => (1024,)
[MaskRCNN] INFO : [#0094] block_3c_conv_1/kernel:0 => (1, 1, 1024, 256)
[MaskRCNN] INFO : [#0095] block_3c_bn_1/gamma:0 => (256,)
[MaskRCNN] INFO : [#0096] block_3c_bn_1/beta:0 => (256,)
[MaskRCNN] INFO : [#0097] block_3c_conv_2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0098] block_3c_bn_2/gamma:0 => (256,)
[MaskRCNN] INFO : [#0099] block_3c_bn_2/beta:0 => (256,)
[MaskRCNN] INFO : [#0100] block_3c_conv_3/kernel:0 => (1, 1, 256, 1024)
[MaskRCNN] INFO : [#0101] block_3c_bn_3/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0102] block_3c_bn_3/beta:0 => (1024,)
[MaskRCNN] INFO : [#0103] block_3d_conv_1/kernel:0 => (1, 1, 1024, 256)
[MaskRCNN] INFO : [#0104] block_3d_bn_1/gamma:0 => (256,)
[MaskRCNN] INFO : [#0105] block_3d_bn_1/beta:0 => (256,)
[MaskRCNN] INFO : [#0106] block_3d_conv_2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0107] block_3d_bn_2/gamma:0 => (256,)
[MaskRCNN] INFO : [#0108] block_3d_bn_2/beta:0 => (256,)
[MaskRCNN] INFO : [#0109] block_3d_conv_3/kernel:0 => (1, 1, 256, 1024)
[MaskRCNN] INFO : [#0110] block_3d_bn_3/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0111] block_3d_bn_3/beta:0 => (1024,)
[MaskRCNN] INFO : [#0112] block_3e_conv_1/kernel:0 => (1, 1, 1024, 256)
[MaskRCNN] INFO : [#0113] block_3e_bn_1/gamma:0 => (256,)
[MaskRCNN] INFO : [#0114] block_3e_bn_1/beta:0 => (256,)
[MaskRCNN] INFO : [#0115] block_3e_conv_2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0116] block_3e_bn_2/gamma:0 => (256,)
[MaskRCNN] INFO : [#0117] block_3e_bn_2/beta:0 => (256,)
[MaskRCNN] INFO : [#0118] block_3e_conv_3/kernel:0 => (1, 1, 256, 1024)
[MaskRCNN] INFO : [#0119] block_3e_bn_3/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0120] block_3e_bn_3/beta:0 => (1024,)
[MaskRCNN] INFO : [#0121] block_3f_conv_1/kernel:0 => (1, 1, 1024, 256)
[MaskRCNN] INFO : [#0122] block_3f_bn_1/gamma:0 => (256,)
[MaskRCNN] INFO : [#0123] block_3f_bn_1/beta:0 => (256,)
[MaskRCNN] INFO : [#0124] block_3f_conv_2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0125] block_3f_bn_2/gamma:0 => (256,)
[MaskRCNN] INFO : [#0126] block_3f_bn_2/beta:0 => (256,)
[MaskRCNN] INFO : [#0127] block_3f_conv_3/kernel:0 => (1, 1, 256, 1024)
[MaskRCNN] INFO : [#0128] block_3f_bn_3/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0129] block_3f_bn_3/beta:0 => (1024,)
[MaskRCNN] INFO : [#0130] block_4a_conv_1/kernel:0 => (1, 1, 1024, 512)
[MaskRCNN] INFO : [#0131] block_4a_bn_1/gamma:0 => (512,)
[MaskRCNN] INFO : [#0132] block_4a_bn_1/beta:0 => (512,)
[MaskRCNN] INFO : [#0133] block_4a_conv_2/kernel:0 => (3, 3, 512, 512)
[MaskRCNN] INFO : [#0134] block_4a_bn_2/gamma:0 => (512,)
[MaskRCNN] INFO : [#0135] block_4a_bn_2/beta:0 => (512,)
[MaskRCNN] INFO : [#0136] block_4a_conv_3/kernel:0 => (1, 1, 512, 2048)
[MaskRCNN] INFO : [#0137] block_4a_bn_3/gamma:0 => (2048,)
[MaskRCNN] INFO : [#0138] block_4a_bn_3/beta:0 => (2048,)
[MaskRCNN] INFO : [#0139] block_4a_conv_shortcut/kernel:0 => (1, 1, 1024, 2048)
[MaskRCNN] INFO : [#0140] block_4a_bn_shortcut/gamma:0 => (2048,)
[MaskRCNN] INFO : [#0141] block_4a_bn_shortcut/beta:0 => (2048,)
[MaskRCNN] INFO : [#0142] block_4b_conv_1/kernel:0 => (1, 1, 2048, 512)
[MaskRCNN] INFO : [#0143] block_4b_bn_1/gamma:0 => (512,)
[MaskRCNN] INFO : [#0144] block_4b_bn_1/beta:0 => (512,)
[MaskRCNN] INFO : [#0145] block_4b_conv_2/kernel:0 => (3, 3, 512, 512)
[MaskRCNN] INFO : [#0146] block_4b_bn_2/gamma:0 => (512,)
[MaskRCNN] INFO : [#0147] block_4b_bn_2/beta:0 => (512,)
[MaskRCNN] INFO : [#0148] block_4b_conv_3/kernel:0 => (1, 1, 512, 2048)
[MaskRCNN] INFO : [#0149] block_4b_bn_3/gamma:0 => (2048,)
[MaskRCNN] INFO : [#0150] block_4b_bn_3/beta:0 => (2048,)
[MaskRCNN] INFO : [#0151] block_4c_conv_1/kernel:0 => (1, 1, 2048, 512)
[MaskRCNN] INFO : [#0152] block_4c_bn_1/gamma:0 => (512,)
[MaskRCNN] INFO : [#0153] block_4c_bn_1/beta:0 => (512,)
[MaskRCNN] INFO : [#0154] block_4c_conv_2/kernel:0 => (3, 3, 512, 512)
[MaskRCNN] INFO : [#0155] block_4c_bn_2/gamma:0 => (512,)
[MaskRCNN] INFO : [#0156] block_4c_bn_2/beta:0 => (512,)
[MaskRCNN] INFO : [#0157] block_4c_conv_3/kernel:0 => (1, 1, 512, 2048)
[MaskRCNN] INFO : [#0158] block_4c_bn_3/gamma:0 => (2048,)
[MaskRCNN] INFO : [#0159] block_4c_bn_3/beta:0 => (2048,)
[MaskRCNN] INFO : [#0160] l2/kernel:0 => (1, 1, 256, 256)
[MaskRCNN] INFO : [#0161] l2/bias:0 => (256,)
[MaskRCNN] INFO : [#0162] l3/kernel:0 => (1, 1, 512, 256)
[MaskRCNN] INFO : [#0163] l3/bias:0 => (256,)
[MaskRCNN] INFO : [#0164] l4/kernel:0 => (1, 1, 1024, 256)
[MaskRCNN] INFO : [#0165] l4/bias:0 => (256,)
[MaskRCNN] INFO : [#0166] l5/kernel:0 => (1, 1, 2048, 256)
[MaskRCNN] INFO : [#0167] l5/bias:0 => (256,)
[MaskRCNN] INFO : [#0168] post_hoc_d2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0169] post_hoc_d2/bias:0 => (256,)
[MaskRCNN] INFO : [#0170] post_hoc_d3/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0171] post_hoc_d3/bias:0 => (256,)
[MaskRCNN] INFO : [#0172] post_hoc_d4/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0173] post_hoc_d4/bias:0 => (256,)
[MaskRCNN] INFO : [#0174] post_hoc_d5/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0175] post_hoc_d5/bias:0 => (256,)
[MaskRCNN] INFO : [#0176] rpn/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0177] rpn/bias:0 => (256,)
[MaskRCNN] INFO : [#0178] rpn-class/kernel:0 => (1, 1, 256, 3)
[MaskRCNN] INFO : [#0179] rpn-class/bias:0 => (3,)
[MaskRCNN] INFO : [#0180] rpn-box/kernel:0 => (1, 1, 256, 12)
[MaskRCNN] INFO : [#0181] rpn-box/bias:0 => (12,)
[MaskRCNN] INFO : [#0182] fc6/kernel:0 => (12544, 1024)
[MaskRCNN] INFO : [#0183] fc6/bias:0 => (1024,)
[MaskRCNN] INFO : [#0184] fc7/kernel:0 => (1024, 1024)
[MaskRCNN] INFO : [#0185] fc7/bias:0 => (1024,)
[MaskRCNN] INFO : [#0186] class-predict/kernel:0 => (1024, 2)
[MaskRCNN] INFO : [#0187] class-predict/bias:0 => (2,)
[MaskRCNN] INFO : [#0188] box-predict/kernel:0 => (1024, 8)
[MaskRCNN] INFO : [#0189] box-predict/bias:0 => (8,)
[MaskRCNN] INFO : [#0190] mask-conv-l0/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0191] mask-conv-l0/bias:0 => (256,)
[MaskRCNN] INFO : [#0192] mask-conv-l1/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0193] mask-conv-l1/bias:0 => (256,)
[MaskRCNN] INFO : [#0194] mask-conv-l2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0195] mask-conv-l2/bias:0 => (256,)
[MaskRCNN] INFO : [#0196] mask-conv-l3/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0197] mask-conv-l3/bias:0 => (256,)
[MaskRCNN] INFO : [#0198] conv5-mask/kernel:0 => (2, 2, 256, 256)
[MaskRCNN] INFO : [#0199] conv5-mask/bias:0 => (256,)
[MaskRCNN] INFO : [#0200] mask_fcn_logits/kernel:0 => (1, 1, 256, 2)
[MaskRCNN] INFO : [#0201] mask_fcn_logits/bias:0 => (2,)
[MaskRCNN] INFO : %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
[MaskRCNN] INFO : # ============================================= #
[MaskRCNN] INFO : Start Training
[MaskRCNN] INFO : # %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% #
[GPU 00] Restoring pretrained weights (265 Tensors)
[MaskRCNN] INFO : Pretrained weights loaded with success…
[MaskRCNN] INFO : Saving checkpoints for 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt.
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun.real noticed that process rank 1 with PID 0 on node 41afb24bd50e exited on signal 9 (Killed).
2022-01-27 09:04:07,841 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.