print("For multi-GPU, change --gpus based on your machine.")

!tao mask_rcnn train -e $SPECS_DIR/maskrcnn_train_resnet50.txt \

                     -d $USER_EXPERIMENT_DIR/experiment_dir_unpruned\

                     -k $KEY \

                     --gpus 4

For multi-GPU, change --gpus based on your machine.
2022-06-09 08:35:48,975 [INFO] root: Registry: ['nvcr.io']
2022-06-09 08:35:49,232 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2022-06-09 08:35:49,403 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/sysadmin/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
[INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet50.txt
Using TensorFlow backend.
[INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet50.txt
Using TensorFlow backend.
[INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet50.txt
Using TensorFlow backend.
[INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet50.txt
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpv8s72p4s', '_tf_random_seed': 126, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 26
gpu_options {
  allow_growth: true
  force_gpu_compatible: true
}
allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: TWO
  }
}
, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f85577bd358>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[MaskRCNN] INFO    : Horovod successfully initialized ...
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpbj7q43mf', '_tf_random_seed': 124, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 26
gpu_options {
  allow_growth: true
  force_gpu_compatible: true
}
allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: TWO
  }
}
, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fbe66cb0320>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpc626qosb', '_tf_random_seed': 123, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 26
gpu_options {
  allow_growth: true
  force_gpu_compatible: true
}
allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: TWO
  }
}
, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f149fd6a550>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmp1ckrizwj', '_tf_random_seed': 125, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 26
gpu_options {
  allow_growth: true
  force_gpu_compatible: true
}
allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: TWO
  }
}
, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fe2c5a702b0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[MaskRCNN] INFO    : Loading pretrained model...
WARNING:tensorflow:Entity <function InputReader.__call__.<locals>._prefetch_dataset at 0x7f855412c268> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function InputReader.__call__.<locals>._prefetch_dataset at 0x7f855412c268>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function InputReader.__call__.<locals>._prefetch_dataset at 0x7fe2c419a268> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function InputReader.__call__.<locals>._prefetch_dataset at 0x7fe2c419a268>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function InputReader.__call__.<locals>._prefetch_dataset at 0x7fbe64046268> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function InputReader.__call__.<locals>._prefetch_dataset at 0x7fbe64046268>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code

WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:349: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:349: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:349: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:Entity <function dataset_parser at 0x7f8564301840> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function dataset_parser at 0x7f8564301840>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function dataset_parser at 0x7fe2d25b4840> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function dataset_parser at 0x7fe2d25b4840>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:Entity <function dataset_parser at 0x7fbe737f4840> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function dataset_parser at 0x7fbe737f4840>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Entity <bound method AnchorLayer.call of <iva.mask_rcnn.layers.anchor_layer.AnchorLayer object at 0x7f8554130da0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method AnchorLayer.call of <iva.mask_rcnn.layers.anchor_layer.AnchorLayer object at 0x7f8554130da0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Entity <bound method AnchorLayer.call of <iva.mask_rcnn.layers.anchor_layer.AnchorLayer object at 0x7fe2c416f080>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method AnchorLayer.call of <iva.mask_rcnn.layers.anchor_layer.AnchorLayer object at 0x7fe2c416f080>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Entity <bound method AnchorLayer.call of <iva.mask_rcnn.layers.anchor_layer.AnchorLayer object at 0x7fbe5c15d0b8>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method AnchorLayer.call of <iva.mask_rcnn.layers.anchor_layer.AnchorLayer object at 0x7fbe5c15d0b8>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method MultilevelProposal.call of <iva.mask_rcnn.layers.multilevel_proposal_layer.MultilevelProposal object at 0x7fe281c11f28>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MultilevelProposal.call of <iva.mask_rcnn.layers.multilevel_proposal_layer.MultilevelProposal object at 0x7fe281c11f28>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method MultilevelProposal.call of <iva.mask_rcnn.layers.multilevel_proposal_layer.MultilevelProposal object at 0x7fbe2ce85f28>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MultilevelProposal.call of <iva.mask_rcnn.layers.multilevel_proposal_layer.MultilevelProposal object at 0x7fbe2ce85f28>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code

WARNING:tensorflow:Entity <bound method MultilevelProposal.call of <iva.mask_rcnn.layers.multilevel_proposal_layer.MultilevelProposal object at 0x7f851d99bef0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MultilevelProposal.call of <iva.mask_rcnn.layers.multilevel_proposal_layer.MultilevelProposal object at 0x7f851d99bef0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ProposalAssignment.call of <iva.mask_rcnn.layers.proposal_assignment_layer.ProposalAssignment object at 0x7fe281c15518>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ProposalAssignment.call of <iva.mask_rcnn.layers.proposal_assignment_layer.ProposalAssignment object at 0x7fe281c15518>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ProposalAssignment.call of <iva.mask_rcnn.layers.proposal_assignment_layer.ProposalAssignment object at 0x7f851d9204e0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ProposalAssignment.call of <iva.mask_rcnn.layers.proposal_assignment_layer.ProposalAssignment object at 0x7f851d9204e0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ProposalAssignment.call of <iva.mask_rcnn.layers.proposal_assignment_layer.ProposalAssignment object at 0x7fbe2ce89518>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ProposalAssignment.call of <iva.mask_rcnn.layers.proposal_assignment_layer.ProposalAssignment object at 0x7fbe2ce89518>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7fe281a36630>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7fe281a36630>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7fbe2cec6e80>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7fbe2cec6e80>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7f851d7c0630>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7f851d7c0630>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fe2817d4b38>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fe2817d4b38>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fe2817d4be0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fe2817d4be0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fe281788e48>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fe281788e48>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method BoxTargetEncoder.call of <iva.mask_rcnn.layers.box_target_encoder.BoxTargetEncoder object at 0x7fe2817d4c88>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method BoxTargetEncoder.call of <iva.mask_rcnn.layers.box_target_encoder.BoxTargetEncoder object at 0x7fe2817d4c88>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code

WARNING:tensorflow:Entity <bound method ForegroundSelectorForMask.call of <iva.mask_rcnn.layers.foreground_selector_for_mask.ForegroundSelectorForMask object at 0x7fe281795400>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ForegroundSelectorForMask.call of <iva.mask_rcnn.layers.foreground_selector_for_mask.ForegroundSelectorForMask object at 0x7fe281795400>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7f851d55db38>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7f851d55db38>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7fe2817d4c50>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7fe2817d4c50>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fbe2ca47b38>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fbe2ca47b38>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7f851d55dbe0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7f851d55dbe0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7f851d512e48>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7f851d512e48>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method BoxTargetEncoder.call of <iva.mask_rcnn.layers.box_target_encoder.BoxTargetEncoder object at 0x7f851d51fda0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method BoxTargetEncoder.call of <iva.mask_rcnn.layers.box_target_encoder.BoxTargetEncoder object at 0x7f851d51fda0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fbe2ca47be0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fbe2ca47be0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fbe2c9fce48>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fbe2c9fce48>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ForegroundSelectorForMask.call of <iva.mask_rcnn.layers.foreground_selector_for_mask.ForegroundSelectorForMask object at 0x7f851d55dcc0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ForegroundSelectorForMask.call of <iva.mask_rcnn.layers.foreground_selector_for_mask.ForegroundSelectorForMask object at 0x7f851d55dcc0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method BoxTargetEncoder.call of <iva.mask_rcnn.layers.box_target_encoder.BoxTargetEncoder object at 0x7fbe2ca47c88>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method BoxTargetEncoder.call of <iva.mask_rcnn.layers.box_target_encoder.BoxTargetEncoder object at 0x7fbe2ca47c88>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code

WARNING:tensorflow:Entity <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7f851d51fd68>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7f851d51fd68>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ForegroundSelectorForMask.call of <iva.mask_rcnn.layers.foreground_selector_for_mask.ForegroundSelectorForMask object at 0x7fbe2ca09400>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ForegroundSelectorForMask.call of <iva.mask_rcnn.layers.foreground_selector_for_mask.ForegroundSelectorForMask object at 0x7fbe2ca09400>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7fbe2ca47c50>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7fbe2ca47c50>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fe2815c2748>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fe2815c2748>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7f851d34c748>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7f851d34c748>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fbe2c836748>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7fbe2c836748>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method MaskPostprocess.call of <iva.mask_rcnn.layers.mask_postprocess_layer.MaskPostprocess object at 0x7fbe2c7b3a90>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MaskPostprocess.call of <iva.mask_rcnn.layers.mask_postprocess_layer.MaskPostprocess object at 0x7fbe2c7b3a90>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method MaskTargetsLayer.call of <iva.mask_rcnn.layers.mask_targets_layer.MaskTargetsLayer object at 0x7fbe2c836908>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MaskTargetsLayer.call of <iva.mask_rcnn.layers.mask_targets_layer.MaskTargetsLayer object at 0x7fbe2c836908>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method MaskPostprocess.call of <iva.mask_rcnn.layers.mask_postprocess_layer.MaskPostprocess object at 0x7fe28153ea90>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MaskPostprocess.call of <iva.mask_rcnn.layers.mask_postprocess_layer.MaskPostprocess object at 0x7fe28153ea90>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method MaskTargetsLayer.call of <iva.mask_rcnn.layers.mask_targets_layer.MaskTargetsLayer object at 0x7fe2815c2908>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MaskTargetsLayer.call of <iva.mask_rcnn.layers.mask_targets_layer.MaskTargetsLayer object at 0x7fe2815c2908>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code

WARNING:tensorflow:Entity <bound method MaskPostprocess.call of <iva.mask_rcnn.layers.mask_postprocess_layer.MaskPostprocess object at 0x7f851d34c828>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MaskPostprocess.call of <iva.mask_rcnn.layers.mask_postprocess_layer.MaskPostprocess object at 0x7f851d34c828>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method MaskTargetsLayer.call of <iva.mask_rcnn.layers.mask_targets_layer.MaskTargetsLayer object at 0x7f851d34c3c8>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MaskTargetsLayer.call of <iva.mask_rcnn.layers.mask_targets_layer.MaskTargetsLayer object at 0x7f851d34c3c8>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py:220: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py:223: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py:224: The name tf.keras.backend.get_session is deprecated. Please use tf.compat.v1.keras.backend.get_session instead.

INFO:tensorflow:Done calling model_fn.
[MaskRCNN] INFO    : Create EncryptCheckpointSaverHook.

[MaskRCNN] INFO    : =================================
[MaskRCNN] INFO    :      Start training cycle 01
[MaskRCNN] INFO    : =================================
    
[MaskRCNN] INFO    : Using Dataset Sharding with Horovod
WARNING:tensorflow:Entity <function InputReader.__call__.<locals>._prefetch_dataset at 0x7f14659569d8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function InputReader.__call__.<locals>._prefetch_dataset at 0x7f14659569d8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:349: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

INFO:tensorflow:Done calling model_fn.
WARNING:tensorflow:Entity <function dataset_parser at 0x7f14ac8ab950> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function dataset_parser at 0x7f14ac8ab950>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Calling model_fn.
[MaskRCNN] INFO    : ***********************
[MaskRCNN] INFO    : Building model graph...
[MaskRCNN] INFO    : ***********************
WARNING:tensorflow:Entity <bound method AnchorLayer.call of <iva.mask_rcnn.layers.anchor_layer.AnchorLayer object at 0x7f14650e1390>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method AnchorLayer.call of <iva.mask_rcnn.layers.anchor_layer.AnchorLayer object at 0x7f14650e1390>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
WARNING:tensorflow:Entity <bound method MultilevelProposal.call of <iva.mask_rcnn.layers.multilevel_proposal_layer.MultilevelProposal object at 0x7f1464567f98>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MultilevelProposal.call of <iva.mask_rcnn.layers.multilevel_proposal_layer.MultilevelProposal object at 0x7f1464567f98>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code

[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_2/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_3/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_4/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_5/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_6/
WARNING:tensorflow:Entity <bound method ProposalAssignment.call of <iva.mask_rcnn.layers.proposal_assignment_layer.ProposalAssignment object at 0x7f146456d5c0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ProposalAssignment.call of <iva.mask_rcnn.layers.proposal_assignment_layer.ProposalAssignment object at 0x7f146456d5c0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO:tensorflow:Running local_init_op.
WARNING:tensorflow:Entity <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7f14644136a0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7f14644136a0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO:tensorflow:Done running local_init_op.
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7f146412cba8>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7f146412cba8>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7f146412cc50>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7f146412cc50>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7f1464161eb8>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7f1464161eb8>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method BoxTargetEncoder.call of <iva.mask_rcnn.layers.box_target_encoder.BoxTargetEncoder object at 0x7f146412ccf8>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method BoxTargetEncoder.call of <iva.mask_rcnn.layers.box_target_encoder.BoxTargetEncoder object at 0x7f146412ccf8>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ForegroundSelectorForMask.call of <iva.mask_rcnn.layers.foreground_selector_for_mask.ForegroundSelectorForMask object at 0x7f145ffc4470>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ForegroundSelectorForMask.call of <iva.mask_rcnn.layers.foreground_selector_for_mask.ForegroundSelectorForMask object at 0x7f145ffc4470>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7f146412ccc0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MultilevelCropResize.call of <iva.mask_rcnn.layers.multilevel_crop_resize_layer.MultilevelCropResize object at 0x7f146412ccc0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7f145fe777b8>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method ReshapeLayer.call of <iva.mask_rcnn.layers.reshape_layer.ReshapeLayer object at 0x7f145fe777b8>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
WARNING:tensorflow:Entity <bound method MaskPostprocess.call of <iva.mask_rcnn.layers.mask_postprocess_layer.MaskPostprocess object at 0x7f145fe77898>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MaskPostprocess.call of <iva.mask_rcnn.layers.mask_postprocess_layer.MaskPostprocess object at 0x7f145fe77898>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
Parsing Inputs...
[MaskRCNN] INFO    : [Training Compute Statistics] 517.5 GFLOPS/image
4 ops no flops stats due to incomplete shapes.
WARNING:tensorflow:Entity <bound method MaskTargetsLayer.call of <iva.mask_rcnn.layers.mask_targets_layer.MaskTargetsLayer object at 0x7f145f27cfd0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method MaskTargetsLayer.call of <iva.mask_rcnn.layers.mask_targets_layer.MaskTargetsLayer object at 0x7f145f27cfd0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:Done calling model_fn.
[MaskRCNN] WARNING : Checkpoint is missing variable [l2/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l2/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [l3/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l3/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [l4/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l4/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [l5/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l5/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d2/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d2/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d3/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d3/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d4/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d4/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d5/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d5/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-class/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-class/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-box/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-box/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc6/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc6/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc7/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc7/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [class-predict/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [class-predict/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [box-predict/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [box-predict/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l0/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l0/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l1/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l1/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l2/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l2/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l3/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l3/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [conv5-mask/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [conv5-mask/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_fcn_logits/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_fcn_logits/bias]
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
[MaskRCNN] INFO    : ============================ GIT REPOSITORY ============================
[MaskRCNN] INFO    : BRANCH NAME: 
[MaskRCNN] INFO    : %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    
[MaskRCNN] INFO    : ============================ MODEL STATISTICS ===========================
[MaskRCNN] INFO    : # Model Weights: 28,650,305
[MaskRCNN] INFO    : # Trainable Weights: 44,067,009
[MaskRCNN] INFO    : %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    
[MaskRCNN] INFO    : ============================ TRAINABLE VARIABLES ========================
[MaskRCNN] INFO    : [#0001] conv1/kernel:0                                               => (7, 7, 3, 64)
[MaskRCNN] INFO    : [#0002] bn_conv1/gamma:0                                             => (64,)
[MaskRCNN] INFO    : [#0003] bn_conv1/beta:0                                              => (64,)
[MaskRCNN] INFO    : [#0004] block_1a_conv_1/kernel:0                                     => (1, 1, 64, 64)
[MaskRCNN] INFO    : [#0005] block_1a_bn_1/gamma:0                                        => (64,)
[MaskRCNN] INFO    : [#0006] block_1a_bn_1/beta:0                                         => (64,)
[MaskRCNN] INFO    : [#0007] block_1a_conv_2/kernel:0                                     => (3, 3, 64, 64)
[MaskRCNN] INFO    : [#0008] block_1a_bn_2/gamma:0                                        => (64,)
[MaskRCNN] INFO    : [#0009] block_1a_bn_2/beta:0                                         => (64,)
[MaskRCNN] INFO    : [#0010] block_1a_conv_3/kernel:0                                     => (1, 1, 64, 256)
[MaskRCNN] INFO    : [#0011] block_1a_bn_3/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0012] block_1a_bn_3/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0013] block_1a_conv_shortcut/kernel:0                              => (1, 1, 64, 256)
[MaskRCNN] INFO    : [#0014] block_1a_bn_shortcut/gamma:0                                 => (256,)
[MaskRCNN] INFO    : [#0015] block_1a_bn_shortcut/beta:0                                  => (256,)
[MaskRCNN] INFO    : [#0016] block_1b_conv_1/kernel:0                                     => (1, 1, 256, 64)
[MaskRCNN] INFO    : [#0017] block_1b_bn_1/gamma:0                                        => (64,)
[MaskRCNN] INFO    : [#0018] block_1b_bn_1/beta:0                                         => (64,)
[MaskRCNN] INFO    : [#0019] block_1b_conv_2/kernel:0                                     => (3, 3, 64, 64)
[MaskRCNN] INFO    : [#0020] block_1b_bn_2/gamma:0                                        => (64,)
[MaskRCNN] INFO    : [#0021] block_1b_bn_2/beta:0                                         => (64,)
[MaskRCNN] INFO    : [#0022] block_1b_conv_3/kernel:0                                     => (1, 1, 64, 256)
[MaskRCNN] INFO    : [#0023] block_1b_bn_3/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0024] block_1b_bn_3/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0025] block_1c_conv_1/kernel:0                                     => (1, 1, 256, 64)
[MaskRCNN] INFO    : [#0026] block_1c_bn_1/gamma:0                                        => (64,)
[MaskRCNN] INFO    : [#0027] block_1c_bn_1/beta:0                                         => (64,)
[MaskRCNN] INFO    : [#0028] block_1c_conv_2/kernel:0                                     => (3, 3, 64, 64)
[MaskRCNN] INFO    : [#0029] block_1c_bn_2/gamma:0                                        => (64,)
[MaskRCNN] INFO    : [#0030] block_1c_bn_2/beta:0                                         => (64,)
[MaskRCNN] INFO    : [#0031] block_1c_conv_3/kernel:0                                     => (1, 1, 64, 256)
[MaskRCNN] INFO    : [#0032] block_1c_bn_3/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0033] block_1c_bn_3/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0034] block_2a_conv_1/kernel:0                                     => (1, 1, 256, 128)
[MaskRCNN] INFO    : [#0035] block_2a_bn_1/gamma:0                                        => (128,)
[MaskRCNN] INFO    : [#0036] block_2a_bn_1/beta:0                                         => (128,)
[MaskRCNN] INFO    : [#0037] block_2a_conv_2/kernel:0                                     => (3, 3, 128, 128)
[MaskRCNN] INFO    : [#0038] block_2a_bn_2/gamma:0                                        => (128,)
[MaskRCNN] INFO    : [#0039] block_2a_bn_2/beta:0                                         => (128,)
[MaskRCNN] INFO    : [#0040] block_2a_conv_3/kernel:0                                     => (1, 1, 128, 512)
[MaskRCNN] INFO    : [#0041] block_2a_bn_3/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0042] block_2a_bn_3/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0043] block_2a_conv_shortcut/kernel:0                              => (1, 1, 256, 512)
[MaskRCNN] INFO    : [#0044] block_2a_bn_shortcut/gamma:0                                 => (512,)
[MaskRCNN] INFO    : [#0045] block_2a_bn_shortcut/beta:0                                  => (512,)
[MaskRCNN] INFO    : [#0046] block_2b_conv_1/kernel:0                                     => (1, 1, 512, 128)
[MaskRCNN] INFO    : [#0047] block_2b_bn_1/gamma:0                                        => (128,)
[MaskRCNN] INFO    : [#0048] block_2b_bn_1/beta:0                                         => (128,)
[MaskRCNN] INFO    : [#0049] block_2b_conv_2/kernel:0                                     => (3, 3, 128, 128)
[MaskRCNN] INFO    : [#0050] block_2b_bn_2/gamma:0                                        => (128,)
[MaskRCNN] INFO    : [#0051] block_2b_bn_2/beta:0                                         => (128,)
[MaskRCNN] INFO    : [#0052] block_2b_conv_3/kernel:0                                     => (1, 1, 128, 512)
[MaskRCNN] INFO    : [#0053] block_2b_bn_3/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0054] block_2b_bn_3/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0055] block_2c_conv_1/kernel:0                                     => (1, 1, 512, 128)
[MaskRCNN] INFO    : [#0056] block_2c_bn_1/gamma:0                                        => (128,)
[MaskRCNN] INFO    : [#0057] block_2c_bn_1/beta:0                                         => (128,)
[MaskRCNN] INFO    : [#0058] block_2c_conv_2/kernel:0                                     => (3, 3, 128, 128)
[MaskRCNN] INFO    : [#0059] block_2c_bn_2/gamma:0                                        => (128,)
[MaskRCNN] INFO    : [#0060] block_2c_bn_2/beta:0                                         => (128,)
[MaskRCNN] INFO    : [#0061] block_2c_conv_3/kernel:0                                     => (1, 1, 128, 512)
[MaskRCNN] INFO    : [#0062] block_2c_bn_3/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0063] block_2c_bn_3/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0064] block_2d_conv_1/kernel:0                                     => (1, 1, 512, 128)
[MaskRCNN] INFO    : [#0065] block_2d_bn_1/gamma:0                                        => (128,)
[MaskRCNN] INFO    : [#0066] block_2d_bn_1/beta:0                                         => (128,)
[MaskRCNN] INFO    : [#0067] block_2d_conv_2/kernel:0                                     => (3, 3, 128, 128)
[MaskRCNN] INFO    : [#0068] block_2d_bn_2/gamma:0                                        => (128,)
[MaskRCNN] INFO    : [#0069] block_2d_bn_2/beta:0                                         => (128,)
[MaskRCNN] INFO    : [#0070] block_2d_conv_3/kernel:0                                     => (1, 1, 128, 512)
[MaskRCNN] INFO    : [#0071] block_2d_bn_3/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0072] block_2d_bn_3/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0073] block_3a_conv_1/kernel:0                                     => (1, 1, 512, 256)
[MaskRCNN] INFO    : [#0074] block_3a_bn_1/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0075] block_3a_bn_1/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0076] block_3a_conv_2/kernel:0                                     => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0077] block_3a_bn_2/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0078] block_3a_bn_2/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0079] block_3a_conv_3/kernel:0                                     => (1, 1, 256, 1024)
[MaskRCNN] INFO    : [#0080] block_3a_bn_3/gamma:0                                        => (1024,)
[MaskRCNN] INFO    : [#0081] block_3a_bn_3/beta:0                                         => (1024,)
[MaskRCNN] INFO    : [#0082] block_3a_conv_shortcut/kernel:0                              => (1, 1, 512, 1024)
[MaskRCNN] INFO    : [#0083] block_3a_bn_shortcut/gamma:0                                 => (1024,)
[MaskRCNN] INFO    : [#0084] block_3a_bn_shortcut/beta:0                                  => (1024,)
[MaskRCNN] INFO    : [#0085] block_3b_conv_1/kernel:0                                     => (1, 1, 1024, 256)
[MaskRCNN] INFO    : [#0086] block_3b_bn_1/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0087] block_3b_bn_1/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0088] block_3b_conv_2/kernel:0                                     => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0089] block_3b_bn_2/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0090] block_3b_bn_2/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0091] block_3b_conv_3/kernel:0                                     => (1, 1, 256, 1024)
[MaskRCNN] INFO    : [#0092] block_3b_bn_3/gamma:0                                        => (1024,)
[MaskRCNN] INFO    : [#0093] block_3b_bn_3/beta:0                                         => (1024,)
[MaskRCNN] INFO    : [#0094] block_3c_conv_1/kernel:0                                     => (1, 1, 1024, 256)
[MaskRCNN] INFO    : [#0095] block_3c_bn_1/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0096] block_3c_bn_1/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0097] block_3c_conv_2/kernel:0                                     => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0098] block_3c_bn_2/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0099] block_3c_bn_2/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0100] block_3c_conv_3/kernel:0                                     => (1, 1, 256, 1024)
[MaskRCNN] INFO    : [#0101] block_3c_bn_3/gamma:0                                        => (1024,)
[MaskRCNN] INFO    : [#0102] block_3c_bn_3/beta:0                                         => (1024,)
[MaskRCNN] INFO    : [#0103] block_3d_conv_1/kernel:0                                     => (1, 1, 1024, 256)
[MaskRCNN] INFO    : [#0104] block_3d_bn_1/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0105] block_3d_bn_1/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0106] block_3d_conv_2/kernel:0                                     => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0107] block_3d_bn_2/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0108] block_3d_bn_2/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0109] block_3d_conv_3/kernel:0                                     => (1, 1, 256, 1024)
[MaskRCNN] INFO    : [#0110] block_3d_bn_3/gamma:0                                        => (1024,)
[MaskRCNN] INFO    : [#0111] block_3d_bn_3/beta:0                                         => (1024,)
[MaskRCNN] INFO    : [#0112] block_3e_conv_1/kernel:0                                     => (1, 1, 1024, 256)
[MaskRCNN] INFO    : [#0113] block_3e_bn_1/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0114] block_3e_bn_1/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0115] block_3e_conv_2/kernel:0                                     => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0116] block_3e_bn_2/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0117] block_3e_bn_2/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0118] block_3e_conv_3/kernel:0                                     => (1, 1, 256, 1024)
[MaskRCNN] INFO    : [#0119] block_3e_bn_3/gamma:0                                        => (1024,)
[MaskRCNN] INFO    : [#0120] block_3e_bn_3/beta:0                                         => (1024,)
[MaskRCNN] INFO    : [#0121] block_3f_conv_1/kernel:0                                     => (1, 1, 1024, 256)
[MaskRCNN] INFO    : [#0122] block_3f_bn_1/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0123] block_3f_bn_1/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0124] block_3f_conv_2/kernel:0                                     => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0125] block_3f_bn_2/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0126] block_3f_bn_2/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0127] block_3f_conv_3/kernel:0                                     => (1, 1, 256, 1024)
[MaskRCNN] INFO    : [#0128] block_3f_bn_3/gamma:0                                        => (1024,)
[MaskRCNN] INFO    : [#0129] block_3f_bn_3/beta:0                                         => (1024,)
[MaskRCNN] INFO    : [#0130] block_4a_conv_1/kernel:0                                     => (1, 1, 1024, 512)
[MaskRCNN] INFO    : [#0131] block_4a_bn_1/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0132] block_4a_bn_1/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0133] block_4a_conv_2/kernel:0                                     => (3, 3, 512, 512)
[MaskRCNN] INFO    : [#0134] block_4a_bn_2/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0135] block_4a_bn_2/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0136] block_4a_conv_3/kernel:0                                     => (1, 1, 512, 2048)
[MaskRCNN] INFO    : [#0137] block_4a_bn_3/gamma:0                                        => (2048,)
[MaskRCNN] INFO    : [#0138] block_4a_bn_3/beta:0                                         => (2048,)
[MaskRCNN] INFO    : [#0139] block_4a_conv_shortcut/kernel:0                              => (1, 1, 1024, 2048)
[MaskRCNN] INFO    : [#0140] block_4a_bn_shortcut/gamma:0                                 => (2048,)
[MaskRCNN] INFO    : [#0141] block_4a_bn_shortcut/beta:0                                  => (2048,)
[MaskRCNN] INFO    : [#0142] block_4b_conv_1/kernel:0                                     => (1, 1, 2048, 512)
[MaskRCNN] INFO    : [#0143] block_4b_bn_1/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0144] block_4b_bn_1/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0145] block_4b_conv_2/kernel:0                                     => (3, 3, 512, 512)
[MaskRCNN] INFO    : [#0146] block_4b_bn_2/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0147] block_4b_bn_2/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0148] block_4b_conv_3/kernel:0                                     => (1, 1, 512, 2048)
[MaskRCNN] INFO    : [#0149] block_4b_bn_3/gamma:0                                        => (2048,)
[MaskRCNN] INFO    : [#0150] block_4b_bn_3/beta:0                                         => (2048,)
[MaskRCNN] INFO    : [#0151] block_4c_conv_1/kernel:0                                     => (1, 1, 2048, 512)
[MaskRCNN] INFO    : [#0152] block_4c_bn_1/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0153] block_4c_bn_1/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0154] block_4c_conv_2/kernel:0                                     => (3, 3, 512, 512)
[MaskRCNN] INFO    : [#0155] block_4c_bn_2/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0156] block_4c_bn_2/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0157] block_4c_conv_3/kernel:0                                     => (1, 1, 512, 2048)
[MaskRCNN] INFO    : [#0158] block_4c_bn_3/gamma:0                                        => (2048,)
[MaskRCNN] INFO    : [#0159] block_4c_bn_3/beta:0                                         => (2048,)
[MaskRCNN] INFO    : [#0160] l2/kernel:0                                                  => (1, 1, 256, 256)
[MaskRCNN] INFO    : [#0161] l2/bias:0                                                    => (256,)
[MaskRCNN] INFO    : [#0162] l3/kernel:0                                                  => (1, 1, 512, 256)
[MaskRCNN] INFO    : [#0163] l3/bias:0                                                    => (256,)
[MaskRCNN] INFO    : [#0164] l4/kernel:0                                                  => (1, 1, 1024, 256)
[MaskRCNN] INFO    : [#0165] l4/bias:0                                                    => (256,)
[MaskRCNN] INFO    : [#0166] l5/kernel:0                                                  => (1, 1, 2048, 256)
[MaskRCNN] INFO    : [#0167] l5/bias:0                                                    => (256,)
[MaskRCNN] INFO    : [#0168] post_hoc_d2/kernel:0                                         => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0169] post_hoc_d2/bias:0                                           => (256,)
[MaskRCNN] INFO    : [#0170] post_hoc_d3/kernel:0                                         => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0171] post_hoc_d3/bias:0                                           => (256,)
[MaskRCNN] INFO    : [#0172] post_hoc_d4/kernel:0                                         => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0173] post_hoc_d4/bias:0                                           => (256,)
[MaskRCNN] INFO    : [#0174] post_hoc_d5/kernel:0                                         => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0175] post_hoc_d5/bias:0                                           => (256,)
[MaskRCNN] INFO    : [#0176] rpn/kernel:0                                                 => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0177] rpn/bias:0                                                   => (256,)
[MaskRCNN] INFO    : [#0178] rpn-class/kernel:0                                           => (1, 1, 256, 3)
[MaskRCNN] INFO    : [#0179] rpn-class/bias:0                                             => (3,)
[MaskRCNN] INFO    : [#0180] rpn-box/kernel:0                                             => (1, 1, 256, 12)
[MaskRCNN] INFO    : [#0181] rpn-box/bias:0                                               => (12,)
[MaskRCNN] INFO    : [#0182] fc6/kernel:0                                                 => (12544, 1024)
[MaskRCNN] INFO    : [#0183] fc6/bias:0                                                   => (1024,)
[MaskRCNN] INFO    : [#0184] fc7/kernel:0                                                 => (1024, 1024)
[MaskRCNN] INFO    : [#0185] fc7/bias:0                                                   => (1024,)
[MaskRCNN] INFO    : [#0186] class-predict/kernel:0                                       => (1024, 19)
[MaskRCNN] INFO    : [#0187] class-predict/bias:0                                         => (19,)
[MaskRCNN] INFO    : [#0188] box-predict/kernel:0                                         => (1024, 76)
[MaskRCNN] INFO    : [#0189] box-predict/bias:0                                           => (76,)
[MaskRCNN] INFO    : [#0190] mask-conv-l0/kernel:0                                        => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0191] mask-conv-l0/bias:0                                          => (256,)
[MaskRCNN] INFO    : [#0192] mask-conv-l1/kernel:0                                        => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0193] mask-conv-l1/bias:0                                          => (256,)
[MaskRCNN] INFO    : [#0194] mask-conv-l2/kernel:0                                        => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0195] mask-conv-l2/bias:0                                          => (256,)
[MaskRCNN] INFO    : [#0196] mask-conv-l3/kernel:0                                        => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0197] mask-conv-l3/bias:0                                          => (256,)
[MaskRCNN] INFO    : [#0198] conv5-mask/kernel:0                                          => (2, 2, 256, 256)
[MaskRCNN] INFO    : [#0199] conv5-mask/bias:0                                            => (256,)
[MaskRCNN] INFO    : [#0200] mask_fcn_logits/kernel:0                                     => (1, 1, 256, 19)
[MaskRCNN] INFO    : [#0201] mask_fcn_logits/bias:0                                       => (19,)
[MaskRCNN] INFO    : %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    

[MaskRCNN] INFO    : # ============================================= #
[MaskRCNN] INFO    :                  Start Training                  
[MaskRCNN] INFO    : # %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% #

[GPU 00] Restoring pretrained weights (265 Tensors)
[MaskRCNN] INFO    : Pretrained weights loaded with success...
    
[MaskRCNN] INFO    : Saving checkpoints for 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt.
f099875a6032:174:700 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
f099875a6032:174:700 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
f099875a6032:174:700 [0] NCCL INFO NET/IB : No device found.
f099875a6032:174:700 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.14<0>
f099875a6032:174:700 [0] NCCL INFO Using network Socket
NCCL version 2.9.9+cuda11.3
f099875a6032:176:695 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
f099875a6032:176:695 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
f099875a6032:176:695 [0] NCCL INFO NET/IB : No device found.
f099875a6032:176:695 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.14<0>
f099875a6032:176:695 [0] NCCL INFO Using network Socket
f099875a6032:175:703 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
f099875a6032:175:703 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
f099875a6032:175:703 [0] NCCL INFO NET/IB : No device found.
f099875a6032:175:703 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.14<0>
f099875a6032:175:703 [0] NCCL INFO Using network Socket
f099875a6032:177:694 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
f099875a6032:177:694 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
f099875a6032:177:694 [0] NCCL INFO NET/IB : No device found.
f099875a6032:177:694 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.14<0>
f099875a6032:177:694 [0] NCCL INFO Using network Socket
f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000
f099875a6032:174:700 [0] NCCL INFO Channel 00/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Channel 01/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff
f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff
f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000

f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-2092d17be727dd49-1-0-1 (size 9637888)
f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2

f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-cbca92963930bd4c-1-1-2 (size 9637888)
f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:177:694 [0] NCCL INFO Channel 00 : 3[ca000] -> 0[31000] via direct shared memory
f099875a6032:177:694 [0] NCCL INFO Channel 01 : 3[ca000] -> 0[31000] via direct shared memory
f099875a6032:174:700 [0] NCCL INFO Channel 00 : 0[31000] -> 1[4b000] via direct shared memory
f099875a6032:174:700 [0] NCCL INFO Channel 01 : 0[31000] -> 1[4b000] via direct shared memory
f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying
f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying

f099875a6032:177:694 [0] include/socket.h:406 NCCL WARN Connect to 127.0.0.1<55501> failed : Connection refused
f099875a6032:177:694 [0] NCCL INFO bootstrap.cc:418 -> 2
f099875a6032:177:694 [0] NCCL INFO transport.cc:102 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2


f099875a6032:174:700 [0] include/socket.h:406 NCCL WARN Connect to 127.0.0.1<38407> failed : Connection refused
f099875a6032:174:700 [0] NCCL INFO bootstrap.cc:418 -> 2
f099875a6032:174:700 [0] NCCL INFO transport.cc:103 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff
f099875a6032:174:700 [0] NCCL INFO Channel 00/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Channel 01/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff
f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000
f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000

f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-a3e13d13d6ddcceb-0-0-1 (size 9637888)

f099875a6032:177:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:177:694 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:177:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-1e87390955df3bed-0-2-3 (size 9637888)
f099875a6032:177:694 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:177:694 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:177:694 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2

f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2

f099875a6032:174:700 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2

f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-4f18fe2e28e6acee-0-1-2 (size 9637888)
f099875a6032:174:700 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:174:700 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-41d5f6d8ab741cec-0-3-0 (size 9637888)
f099875a6032:174:700 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:174:700 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:174:700 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000
f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff
f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000
f099875a6032:174:700 [0] NCCL INFO Channel 00/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Channel 01/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff

f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2


f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-1b53bb07d0f0e07e-0-1-2 (size 9637888)
f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2

f099875a6032:177:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:177:694 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:177:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-eac1f5e2fde96f7d-0-2-3 (size 9637888)
f099875a6032:177:694 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:177:694 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:177:694 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2

f099875a6032:174:700 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:174:700 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:174:700 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-e10b3b2537e507c-0-3-0 (size 9637888)
f099875a6032:174:700 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:174:700 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-701bf9ed7ee8007b-0-0-1 (size 9637888)
f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:174:700 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000
f099875a6032:174:700 [0] NCCL INFO Channel 00/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Channel 01/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff
f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000
f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff


f099875a6032:177:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:177:694 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:177:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-4f70ce1a41ae6589-0-2-3 (size 9637888)
f099875a6032:177:694 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:177:694 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:177:694 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2

f099875a6032:174:700 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:174:700 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:174:700 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-72bf8be997434688-0-3-0 (size 9637888)

f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-d4cad224c2acf687-0-0-1 (size 9637888)
f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2

f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-8002933f14b5d68a-0-1-2 (size 9637888)
f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:174:700 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:174:700 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:174:700 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff
f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000
f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000
f099875a6032:174:700 [0] NCCL INFO Channel 00/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Channel 01/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff

f099875a6032:177:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:177:694 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:177:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-570d47364fcfa72-0-2-3 (size 9637888)
f099875a6032:177:694 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:177:694 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:177:694 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2

f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-8acad87de5fb8b70-0-0-1 (size 9637888)
f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2

f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-3602999838046b73-0-1-2 (size 9637888)
f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2

f099875a6032:174:700 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:174:700 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:174:700 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-28bf9242ba91db71-0-3-0 (size 9637888)
f099875a6032:174:700 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:174:700 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:174:700 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000
f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000
f099875a6032:174:700 [0] NCCL INFO Channel 00/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Channel 01/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff
f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff

f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-7769607d24ae892c-0-1-2 (size 9637888)
f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2

f099875a6032:177:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:177:694 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:177:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-46d79b5851a7182b-0-2-3 (size 9637888)
f099875a6032:177:694 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:177:694 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:177:694 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2

f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2

f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-cc319f62d2a5a929-0-0-1 (size 9637888)
f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2

f099875a6032:174:700 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:174:700 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:174:700 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-6a265927a73bf92a-0-3-0 (size 9637888)
f099875a6032:174:700 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:174:700 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:174:700 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2

f099875a6032:174:700 [0] NCCL INFO Channel 00/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Channel 01/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff
f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff
f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000
f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000

f099875a6032:174:700 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:174:700 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:174:700 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-cb571eacf1e9f55e-0-3-0 (size 9637888)
f099875a6032:174:700 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:174:700 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:174:700 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2

f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-2d6264e81d53a55d-0-0-1 (size 9637888)
f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2

f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-d89a26026f5c8560-0-1-2 (size 9637888)
f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2

f099875a6032:177:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:177:694 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:177:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-a80860dd9c55145f-0-2-3 (size 9637888)
f099875a6032:177:694 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:177:694 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:177:694 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff
f099875a6032:174:700 [0] NCCL INFO Channel 00/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Channel 01/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff
f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000
f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000

f099875a6032:177:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:177:694 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:177:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-14610b51518e9b34-0-2-3 (size 9637888)
f099875a6032:177:694 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:177:694 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:177:694 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2

f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-99bb0f5bd28d2c32-0-0-1 (size 9637888)
f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2

f099875a6032:174:700 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:174:700 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:174:700 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-37afc920a7237c33-0-3-0 (size 9637888)

f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-44f2d07624960c35-0-1-2 (size 9637888)
f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:174:700 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:174:700 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:174:700 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000
f099875a6032:174:700 [0] NCCL INFO Channel 00/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Channel 01/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff
f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff
f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000

f099875a6032:177:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:177:694 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:177:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-53be18d8f7231dc6-0-2-3 (size 9637888)
f099875a6032:177:694 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:177:694 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:177:694 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2

f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-d9181ce37821aec4-0-0-1 (size 9637888)
f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2

f099875a6032:174:700 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:174:700 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:174:700 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-770cd6a84cb7fec5-0-3-0 (size 9637888)
f099875a6032:174:700 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:174:700 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:174:700 [0] NCCL INFO transport.cc:84 -> 2

f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-844fddfdca2a8ec7-0-1-2 (size 9637888)
f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2

f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000
f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000
f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff
f099875a6032:174:700 [0] NCCL INFO Channel 00/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Channel 01/02 :    0   1   2   3
f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff

f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-52eb468ab391d306-0-0-1 (size 9637888)
f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2

f099875a6032:177:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:177:694 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:177:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-cd91428032934208-0-2-3 (size 9637888)
f099875a6032:177:694 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:177:694 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:177:694 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2

f099875a6032:174:700 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:174:700 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:174:700 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-f0e0004f88282307-0-3-0 (size 9637888)
f099875a6032:174:700 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:174:700 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:174:700 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2

f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2

f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-fe2307a5059ab309-0-1-2 (size 9637888)
f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2
f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2
f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
	 [[{{node DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 399, in train_and_eval
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
	 [[node DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0':
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 399, in train_and_eval
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 686, in mask_rcnn_model_fn
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 628, in _model_fn
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 475, in compute_gradients
    avg_grads = self._allreduce_grads(grads, vars)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 398, in allreduce_grads
    for grad in grads]
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 398, in <listcomp>
    for grad in grads]
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 241, in _allreduce_cond
    allreduce_fn, id_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1224, in cond
    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1061, in BuildCondBranch
    original_result = fn()
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 235, in allreduce_fn
    return allreduce(tensor, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 123, in allreduce
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 121, in _allreduce
    ignore_name_scope=ignore_name_scope)
  File "<string>", line 102, in horovod_allreduce
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
	 [[{{node DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 399, in train_and_eval
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
	 [[node DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0':
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 399, in train_and_eval
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 686, in mask_rcnn_model_fn
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 628, in _model_fn
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 475, in compute_gradients
    avg_grads = self._allreduce_grads(grads, vars)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 398, in allreduce_grads
    for grad in grads]
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 398, in <listcomp>
    for grad in grads]
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 241, in _allreduce_cond
    allreduce_fn, id_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1224, in cond
    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1061, in BuildCondBranch
    original_result = fn()
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 235, in allreduce_fn
    return allreduce(tensor, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 123, in allreduce
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 121, in _allreduce
    ignore_name_scope=ignore_name_scope)
  File "<string>", line 102, in horovod_allreduce
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
	 [[{{node DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 399, in train_and_eval
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
	 [[node DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0':
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 399, in train_and_eval
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 686, in mask_rcnn_model_fn
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 628, in _model_fn
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 475, in compute_gradients
    avg_grads = self._allreduce_grads(grads, vars)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 398, in allreduce_grads
    for grad in grads]
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 398, in <listcomp>
    for grad in grads]
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 241, in _allreduce_cond
    allreduce_fn, id_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1224, in cond
    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1061, in BuildCondBranch
    original_result = fn()
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 235, in allreduce_fn
    return allreduce(tensor, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 123, in allreduce
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 121, in _allreduce
    ignore_name_scope=ignore_name_scope)
  File "<string>", line 102, in horovod_allreduce
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
	 [[{{node DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 399, in train_and_eval
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
	 [[node DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0':
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 399, in train_and_eval
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 686, in mask_rcnn_model_fn
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 628, in _model_fn
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 475, in compute_gradients
    avg_grads = self._allreduce_grads(grads, vars)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 398, in allreduce_grads
    for grad in grads]
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 398, in <listcomp>
    for grad in grads]
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 241, in _allreduce_cond
    allreduce_fn, id_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1224, in cond
    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1061, in BuildCondBranch
    original_result = fn()
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 235, in allreduce_fn
    return allreduce(tensor, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 123, in allreduce
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 121, in _allreduce
    ignore_name_scope=ignore_name_scope)
  File "<string>", line 102, in horovod_allreduce
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

[MaskRCNN] INFO    : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
[MaskRCNN] INFO    :           Training Performance Summary           
[MaskRCNN] INFO    : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
DLL 2022-06-09 08:37:38.494784 -   : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ # 
DLL 2022-06-09 08:37:38.495021 -   :           Training Performance Summary            
DLL 2022-06-09 08:37:38.495066 -   : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ # 

DLL 2022-06-09 08:37:38.495115 -  Average_throughput : -1.0 samples/sec 
DLL 2022-06-09 08:37:38.495155 -  Total processed steps : 1 
DLL 2022-06-09 08:37:38.495211 -  Total_processing_time : 0h 00m 00s 
[MaskRCNN] INFO    : Average throughput: -1.0 samples/sec
[MaskRCNN] INFO    : Total processed steps: 1
[MaskRCNN] INFO    : Total processing time: 0h 00m 00s
DLL 2022-06-09 08:37:38.495463 -   : ==================== Metrics ==================== 
[MaskRCNN] INFO    : ==================== Metrics ====================

[MaskRCNN] ERROR   : Job finished with an uncaught exception: `FAILURE`