print("For multi-GPU, change --gpus based on your machine.") !tao mask_rcnn train -e $SPECS_DIR/maskrcnn_train_resnet50.txt \ -d $USER_EXPERIMENT_DIR/experiment_dir_unpruned\ -k $KEY \ --gpus 4 For multi-GPU, change --gpus based on your machine. 2022-06-09 08:35:48,975 [INFO] root: Registry: ['nvcr.io'] 2022-06-09 08:35:49,232 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 2022-06-09 08:35:49,403 [WARNING] tlt.components.docker_handler.docker_handler: Docker will run the commands as root. If you would like to retain your local host permissions, please add the "user":"UID:GID" in the DockerOptions portion of the "/home/sysadmin/.tao_mounts.json" file. You can obtain your users UID and GID by using the "id -u" and "id -g" commands on the terminal. Using TensorFlow backend. WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them. WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them. WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them. WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them. Using TensorFlow backend. [INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet50.txt Using TensorFlow backend. [INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet50.txt Using TensorFlow backend. [INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet50.txt Using TensorFlow backend. [INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet50.txt INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpv8s72p4s', '_tf_random_seed': 126, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1 inter_op_parallelism_threads: 26 gpu_options { allow_growth: true force_gpu_compatible: true } allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: TWO } } , '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': , '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} [MaskRCNN] INFO : Horovod successfully initialized ... INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpbj7q43mf', '_tf_random_seed': 124, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1 inter_op_parallelism_threads: 26 gpu_options { allow_growth: true force_gpu_compatible: true } allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: TWO } } , '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': , '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpc626qosb', '_tf_random_seed': 123, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1 inter_op_parallelism_threads: 26 gpu_options { allow_growth: true force_gpu_compatible: true } allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: TWO } } , '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': , '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmp1ckrizwj', '_tf_random_seed': 125, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1 inter_op_parallelism_threads: 26 gpu_options { allow_growth: true force_gpu_compatible: true } allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: TWO } } , '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': , '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} [MaskRCNN] INFO : Loading pretrained model... WARNING:tensorflow:Entity ._prefetch_dataset at 0x7f855412c268> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of ._prefetch_dataset at 0x7f855412c268>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity ._prefetch_dataset at 0x7fe2c419a268> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of ._prefetch_dataset at 0x7fe2c419a268>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity ._prefetch_dataset at 0x7fbe64046268> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of ._prefetch_dataset at 0x7fbe64046268>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:349: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead. WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:349: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead. WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:349: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead. WARNING:tensorflow:Entity could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of . Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of . Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical. WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical. WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical. WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical. WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical. WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical. WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical. WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical. WARNING:tensorflow:Entity could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of . Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical. WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical. WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical. WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical. INFO:tensorflow:Calling model_fn. WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code INFO:tensorflow:Calling model_fn. WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code INFO:tensorflow:Calling model_fn. WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md * https://github.com/tensorflow/addons * https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue. WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md * https://github.com/tensorflow/addons * https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue. WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md * https://github.com/tensorflow/addons * https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue. WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py:220: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead. WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py:223: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead. WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py:224: The name tf.keras.backend.get_session is deprecated. Please use tf.compat.v1.keras.backend.get_session instead. INFO:tensorflow:Done calling model_fn. [MaskRCNN] INFO : Create EncryptCheckpointSaverHook. [MaskRCNN] INFO : ================================= [MaskRCNN] INFO : Start training cycle 01 [MaskRCNN] INFO : ================================= [MaskRCNN] INFO : Using Dataset Sharding with Horovod WARNING:tensorflow:Entity ._prefetch_dataset at 0x7f14659569d8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of ._prefetch_dataset at 0x7f14659569d8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:349: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead. INFO:tensorflow:Done calling model_fn. WARNING:tensorflow:Entity could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of . Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical. WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical. WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical. WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Graph was finalized. INFO:tensorflow:Calling model_fn. [MaskRCNN] INFO : *********************** [MaskRCNN] INFO : Building model graph... [MaskRCNN] INFO : *********************** WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code INFO:tensorflow:Graph was finalized. INFO:tensorflow:Graph was finalized. WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code [MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_2/ [MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_3/ [MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_4/ [MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_5/ [MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_6/ WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code INFO:tensorflow:Running local_init_op. WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code INFO:tensorflow:Done running local_init_op. WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code INFO:tensorflow:Running local_init_op. INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Done running local_init_op. WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code Parsing Inputs... [MaskRCNN] INFO : [Training Compute Statistics] 517.5 GFLOPS/image 4 ops no flops stats due to incomplete shapes. WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of >. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md * https://github.com/tensorflow/addons * https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue. INFO:tensorflow:Done calling model_fn. [MaskRCNN] WARNING : Checkpoint is missing variable [l2/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [l2/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [l3/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [l3/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [l4/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [l4/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [l5/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [l5/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d2/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d2/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d3/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d3/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d4/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d4/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d5/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d5/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [rpn/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [rpn/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [rpn-class/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [rpn-class/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [rpn-box/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [rpn-box/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [fc6/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [fc6/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [fc7/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [fc7/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [class-predict/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [class-predict/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [box-predict/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [box-predict/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l0/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l0/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l1/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l1/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l2/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l2/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l3/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l3/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [conv5-mask/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [conv5-mask/bias] [MaskRCNN] WARNING : Checkpoint is missing variable [mask_fcn_logits/kernel] [MaskRCNN] WARNING : Checkpoint is missing variable [mask_fcn_logits/bias] INFO:tensorflow:Graph was finalized. INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. fatal: not a git repository (or any of the parent directories): .git fatal: not a git repository (or any of the parent directories): .git [MaskRCNN] INFO : ============================ GIT REPOSITORY ============================ [MaskRCNN] INFO : BRANCH NAME: [MaskRCNN] INFO : %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% [MaskRCNN] INFO : ============================ MODEL STATISTICS =========================== [MaskRCNN] INFO : # Model Weights: 28,650,305 [MaskRCNN] INFO : # Trainable Weights: 44,067,009 [MaskRCNN] INFO : %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% [MaskRCNN] INFO : ============================ TRAINABLE VARIABLES ======================== [MaskRCNN] INFO : [#0001] conv1/kernel:0 => (7, 7, 3, 64) [MaskRCNN] INFO : [#0002] bn_conv1/gamma:0 => (64,) [MaskRCNN] INFO : [#0003] bn_conv1/beta:0 => (64,) [MaskRCNN] INFO : [#0004] block_1a_conv_1/kernel:0 => (1, 1, 64, 64) [MaskRCNN] INFO : [#0005] block_1a_bn_1/gamma:0 => (64,) [MaskRCNN] INFO : [#0006] block_1a_bn_1/beta:0 => (64,) [MaskRCNN] INFO : [#0007] block_1a_conv_2/kernel:0 => (3, 3, 64, 64) [MaskRCNN] INFO : [#0008] block_1a_bn_2/gamma:0 => (64,) [MaskRCNN] INFO : [#0009] block_1a_bn_2/beta:0 => (64,) [MaskRCNN] INFO : [#0010] block_1a_conv_3/kernel:0 => (1, 1, 64, 256) [MaskRCNN] INFO : [#0011] block_1a_bn_3/gamma:0 => (256,) [MaskRCNN] INFO : [#0012] block_1a_bn_3/beta:0 => (256,) [MaskRCNN] INFO : [#0013] block_1a_conv_shortcut/kernel:0 => (1, 1, 64, 256) [MaskRCNN] INFO : [#0014] block_1a_bn_shortcut/gamma:0 => (256,) [MaskRCNN] INFO : [#0015] block_1a_bn_shortcut/beta:0 => (256,) [MaskRCNN] INFO : [#0016] block_1b_conv_1/kernel:0 => (1, 1, 256, 64) [MaskRCNN] INFO : [#0017] block_1b_bn_1/gamma:0 => (64,) [MaskRCNN] INFO : [#0018] block_1b_bn_1/beta:0 => (64,) [MaskRCNN] INFO : [#0019] block_1b_conv_2/kernel:0 => (3, 3, 64, 64) [MaskRCNN] INFO : [#0020] block_1b_bn_2/gamma:0 => (64,) [MaskRCNN] INFO : [#0021] block_1b_bn_2/beta:0 => (64,) [MaskRCNN] INFO : [#0022] block_1b_conv_3/kernel:0 => (1, 1, 64, 256) [MaskRCNN] INFO : [#0023] block_1b_bn_3/gamma:0 => (256,) [MaskRCNN] INFO : [#0024] block_1b_bn_3/beta:0 => (256,) [MaskRCNN] INFO : [#0025] block_1c_conv_1/kernel:0 => (1, 1, 256, 64) [MaskRCNN] INFO : [#0026] block_1c_bn_1/gamma:0 => (64,) [MaskRCNN] INFO : [#0027] block_1c_bn_1/beta:0 => (64,) [MaskRCNN] INFO : [#0028] block_1c_conv_2/kernel:0 => (3, 3, 64, 64) [MaskRCNN] INFO : [#0029] block_1c_bn_2/gamma:0 => (64,) [MaskRCNN] INFO : [#0030] block_1c_bn_2/beta:0 => (64,) [MaskRCNN] INFO : [#0031] block_1c_conv_3/kernel:0 => (1, 1, 64, 256) [MaskRCNN] INFO : [#0032] block_1c_bn_3/gamma:0 => (256,) [MaskRCNN] INFO : [#0033] block_1c_bn_3/beta:0 => (256,) [MaskRCNN] INFO : [#0034] block_2a_conv_1/kernel:0 => (1, 1, 256, 128) [MaskRCNN] INFO : [#0035] block_2a_bn_1/gamma:0 => (128,) [MaskRCNN] INFO : [#0036] block_2a_bn_1/beta:0 => (128,) [MaskRCNN] INFO : [#0037] block_2a_conv_2/kernel:0 => (3, 3, 128, 128) [MaskRCNN] INFO : [#0038] block_2a_bn_2/gamma:0 => (128,) [MaskRCNN] INFO : [#0039] block_2a_bn_2/beta:0 => (128,) [MaskRCNN] INFO : [#0040] block_2a_conv_3/kernel:0 => (1, 1, 128, 512) [MaskRCNN] INFO : [#0041] block_2a_bn_3/gamma:0 => (512,) [MaskRCNN] INFO : [#0042] block_2a_bn_3/beta:0 => (512,) [MaskRCNN] INFO : [#0043] block_2a_conv_shortcut/kernel:0 => (1, 1, 256, 512) [MaskRCNN] INFO : [#0044] block_2a_bn_shortcut/gamma:0 => (512,) [MaskRCNN] INFO : [#0045] block_2a_bn_shortcut/beta:0 => (512,) [MaskRCNN] INFO : [#0046] block_2b_conv_1/kernel:0 => (1, 1, 512, 128) [MaskRCNN] INFO : [#0047] block_2b_bn_1/gamma:0 => (128,) [MaskRCNN] INFO : [#0048] block_2b_bn_1/beta:0 => (128,) [MaskRCNN] INFO : [#0049] block_2b_conv_2/kernel:0 => (3, 3, 128, 128) [MaskRCNN] INFO : [#0050] block_2b_bn_2/gamma:0 => (128,) [MaskRCNN] INFO : [#0051] block_2b_bn_2/beta:0 => (128,) [MaskRCNN] INFO : [#0052] block_2b_conv_3/kernel:0 => (1, 1, 128, 512) [MaskRCNN] INFO : [#0053] block_2b_bn_3/gamma:0 => (512,) [MaskRCNN] INFO : [#0054] block_2b_bn_3/beta:0 => (512,) [MaskRCNN] INFO : [#0055] block_2c_conv_1/kernel:0 => (1, 1, 512, 128) [MaskRCNN] INFO : [#0056] block_2c_bn_1/gamma:0 => (128,) [MaskRCNN] INFO : [#0057] block_2c_bn_1/beta:0 => (128,) [MaskRCNN] INFO : [#0058] block_2c_conv_2/kernel:0 => (3, 3, 128, 128) [MaskRCNN] INFO : [#0059] block_2c_bn_2/gamma:0 => (128,) [MaskRCNN] INFO : [#0060] block_2c_bn_2/beta:0 => (128,) [MaskRCNN] INFO : [#0061] block_2c_conv_3/kernel:0 => (1, 1, 128, 512) [MaskRCNN] INFO : [#0062] block_2c_bn_3/gamma:0 => (512,) [MaskRCNN] INFO : [#0063] block_2c_bn_3/beta:0 => (512,) [MaskRCNN] INFO : [#0064] block_2d_conv_1/kernel:0 => (1, 1, 512, 128) [MaskRCNN] INFO : [#0065] block_2d_bn_1/gamma:0 => (128,) [MaskRCNN] INFO : [#0066] block_2d_bn_1/beta:0 => (128,) [MaskRCNN] INFO : [#0067] block_2d_conv_2/kernel:0 => (3, 3, 128, 128) [MaskRCNN] INFO : [#0068] block_2d_bn_2/gamma:0 => (128,) [MaskRCNN] INFO : [#0069] block_2d_bn_2/beta:0 => (128,) [MaskRCNN] INFO : [#0070] block_2d_conv_3/kernel:0 => (1, 1, 128, 512) [MaskRCNN] INFO : [#0071] block_2d_bn_3/gamma:0 => (512,) [MaskRCNN] INFO : [#0072] block_2d_bn_3/beta:0 => (512,) [MaskRCNN] INFO : [#0073] block_3a_conv_1/kernel:0 => (1, 1, 512, 256) [MaskRCNN] INFO : [#0074] block_3a_bn_1/gamma:0 => (256,) [MaskRCNN] INFO : [#0075] block_3a_bn_1/beta:0 => (256,) [MaskRCNN] INFO : [#0076] block_3a_conv_2/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0077] block_3a_bn_2/gamma:0 => (256,) [MaskRCNN] INFO : [#0078] block_3a_bn_2/beta:0 => (256,) [MaskRCNN] INFO : [#0079] block_3a_conv_3/kernel:0 => (1, 1, 256, 1024) [MaskRCNN] INFO : [#0080] block_3a_bn_3/gamma:0 => (1024,) [MaskRCNN] INFO : [#0081] block_3a_bn_3/beta:0 => (1024,) [MaskRCNN] INFO : [#0082] block_3a_conv_shortcut/kernel:0 => (1, 1, 512, 1024) [MaskRCNN] INFO : [#0083] block_3a_bn_shortcut/gamma:0 => (1024,) [MaskRCNN] INFO : [#0084] block_3a_bn_shortcut/beta:0 => (1024,) [MaskRCNN] INFO : [#0085] block_3b_conv_1/kernel:0 => (1, 1, 1024, 256) [MaskRCNN] INFO : [#0086] block_3b_bn_1/gamma:0 => (256,) [MaskRCNN] INFO : [#0087] block_3b_bn_1/beta:0 => (256,) [MaskRCNN] INFO : [#0088] block_3b_conv_2/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0089] block_3b_bn_2/gamma:0 => (256,) [MaskRCNN] INFO : [#0090] block_3b_bn_2/beta:0 => (256,) [MaskRCNN] INFO : [#0091] block_3b_conv_3/kernel:0 => (1, 1, 256, 1024) [MaskRCNN] INFO : [#0092] block_3b_bn_3/gamma:0 => (1024,) [MaskRCNN] INFO : [#0093] block_3b_bn_3/beta:0 => (1024,) [MaskRCNN] INFO : [#0094] block_3c_conv_1/kernel:0 => (1, 1, 1024, 256) [MaskRCNN] INFO : [#0095] block_3c_bn_1/gamma:0 => (256,) [MaskRCNN] INFO : [#0096] block_3c_bn_1/beta:0 => (256,) [MaskRCNN] INFO : [#0097] block_3c_conv_2/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0098] block_3c_bn_2/gamma:0 => (256,) [MaskRCNN] INFO : [#0099] block_3c_bn_2/beta:0 => (256,) [MaskRCNN] INFO : [#0100] block_3c_conv_3/kernel:0 => (1, 1, 256, 1024) [MaskRCNN] INFO : [#0101] block_3c_bn_3/gamma:0 => (1024,) [MaskRCNN] INFO : [#0102] block_3c_bn_3/beta:0 => (1024,) [MaskRCNN] INFO : [#0103] block_3d_conv_1/kernel:0 => (1, 1, 1024, 256) [MaskRCNN] INFO : [#0104] block_3d_bn_1/gamma:0 => (256,) [MaskRCNN] INFO : [#0105] block_3d_bn_1/beta:0 => (256,) [MaskRCNN] INFO : [#0106] block_3d_conv_2/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0107] block_3d_bn_2/gamma:0 => (256,) [MaskRCNN] INFO : [#0108] block_3d_bn_2/beta:0 => (256,) [MaskRCNN] INFO : [#0109] block_3d_conv_3/kernel:0 => (1, 1, 256, 1024) [MaskRCNN] INFO : [#0110] block_3d_bn_3/gamma:0 => (1024,) [MaskRCNN] INFO : [#0111] block_3d_bn_3/beta:0 => (1024,) [MaskRCNN] INFO : [#0112] block_3e_conv_1/kernel:0 => (1, 1, 1024, 256) [MaskRCNN] INFO : [#0113] block_3e_bn_1/gamma:0 => (256,) [MaskRCNN] INFO : [#0114] block_3e_bn_1/beta:0 => (256,) [MaskRCNN] INFO : [#0115] block_3e_conv_2/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0116] block_3e_bn_2/gamma:0 => (256,) [MaskRCNN] INFO : [#0117] block_3e_bn_2/beta:0 => (256,) [MaskRCNN] INFO : [#0118] block_3e_conv_3/kernel:0 => (1, 1, 256, 1024) [MaskRCNN] INFO : [#0119] block_3e_bn_3/gamma:0 => (1024,) [MaskRCNN] INFO : [#0120] block_3e_bn_3/beta:0 => (1024,) [MaskRCNN] INFO : [#0121] block_3f_conv_1/kernel:0 => (1, 1, 1024, 256) [MaskRCNN] INFO : [#0122] block_3f_bn_1/gamma:0 => (256,) [MaskRCNN] INFO : [#0123] block_3f_bn_1/beta:0 => (256,) [MaskRCNN] INFO : [#0124] block_3f_conv_2/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0125] block_3f_bn_2/gamma:0 => (256,) [MaskRCNN] INFO : [#0126] block_3f_bn_2/beta:0 => (256,) [MaskRCNN] INFO : [#0127] block_3f_conv_3/kernel:0 => (1, 1, 256, 1024) [MaskRCNN] INFO : [#0128] block_3f_bn_3/gamma:0 => (1024,) [MaskRCNN] INFO : [#0129] block_3f_bn_3/beta:0 => (1024,) [MaskRCNN] INFO : [#0130] block_4a_conv_1/kernel:0 => (1, 1, 1024, 512) [MaskRCNN] INFO : [#0131] block_4a_bn_1/gamma:0 => (512,) [MaskRCNN] INFO : [#0132] block_4a_bn_1/beta:0 => (512,) [MaskRCNN] INFO : [#0133] block_4a_conv_2/kernel:0 => (3, 3, 512, 512) [MaskRCNN] INFO : [#0134] block_4a_bn_2/gamma:0 => (512,) [MaskRCNN] INFO : [#0135] block_4a_bn_2/beta:0 => (512,) [MaskRCNN] INFO : [#0136] block_4a_conv_3/kernel:0 => (1, 1, 512, 2048) [MaskRCNN] INFO : [#0137] block_4a_bn_3/gamma:0 => (2048,) [MaskRCNN] INFO : [#0138] block_4a_bn_3/beta:0 => (2048,) [MaskRCNN] INFO : [#0139] block_4a_conv_shortcut/kernel:0 => (1, 1, 1024, 2048) [MaskRCNN] INFO : [#0140] block_4a_bn_shortcut/gamma:0 => (2048,) [MaskRCNN] INFO : [#0141] block_4a_bn_shortcut/beta:0 => (2048,) [MaskRCNN] INFO : [#0142] block_4b_conv_1/kernel:0 => (1, 1, 2048, 512) [MaskRCNN] INFO : [#0143] block_4b_bn_1/gamma:0 => (512,) [MaskRCNN] INFO : [#0144] block_4b_bn_1/beta:0 => (512,) [MaskRCNN] INFO : [#0145] block_4b_conv_2/kernel:0 => (3, 3, 512, 512) [MaskRCNN] INFO : [#0146] block_4b_bn_2/gamma:0 => (512,) [MaskRCNN] INFO : [#0147] block_4b_bn_2/beta:0 => (512,) [MaskRCNN] INFO : [#0148] block_4b_conv_3/kernel:0 => (1, 1, 512, 2048) [MaskRCNN] INFO : [#0149] block_4b_bn_3/gamma:0 => (2048,) [MaskRCNN] INFO : [#0150] block_4b_bn_3/beta:0 => (2048,) [MaskRCNN] INFO : [#0151] block_4c_conv_1/kernel:0 => (1, 1, 2048, 512) [MaskRCNN] INFO : [#0152] block_4c_bn_1/gamma:0 => (512,) [MaskRCNN] INFO : [#0153] block_4c_bn_1/beta:0 => (512,) [MaskRCNN] INFO : [#0154] block_4c_conv_2/kernel:0 => (3, 3, 512, 512) [MaskRCNN] INFO : [#0155] block_4c_bn_2/gamma:0 => (512,) [MaskRCNN] INFO : [#0156] block_4c_bn_2/beta:0 => (512,) [MaskRCNN] INFO : [#0157] block_4c_conv_3/kernel:0 => (1, 1, 512, 2048) [MaskRCNN] INFO : [#0158] block_4c_bn_3/gamma:0 => (2048,) [MaskRCNN] INFO : [#0159] block_4c_bn_3/beta:0 => (2048,) [MaskRCNN] INFO : [#0160] l2/kernel:0 => (1, 1, 256, 256) [MaskRCNN] INFO : [#0161] l2/bias:0 => (256,) [MaskRCNN] INFO : [#0162] l3/kernel:0 => (1, 1, 512, 256) [MaskRCNN] INFO : [#0163] l3/bias:0 => (256,) [MaskRCNN] INFO : [#0164] l4/kernel:0 => (1, 1, 1024, 256) [MaskRCNN] INFO : [#0165] l4/bias:0 => (256,) [MaskRCNN] INFO : [#0166] l5/kernel:0 => (1, 1, 2048, 256) [MaskRCNN] INFO : [#0167] l5/bias:0 => (256,) [MaskRCNN] INFO : [#0168] post_hoc_d2/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0169] post_hoc_d2/bias:0 => (256,) [MaskRCNN] INFO : [#0170] post_hoc_d3/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0171] post_hoc_d3/bias:0 => (256,) [MaskRCNN] INFO : [#0172] post_hoc_d4/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0173] post_hoc_d4/bias:0 => (256,) [MaskRCNN] INFO : [#0174] post_hoc_d5/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0175] post_hoc_d5/bias:0 => (256,) [MaskRCNN] INFO : [#0176] rpn/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0177] rpn/bias:0 => (256,) [MaskRCNN] INFO : [#0178] rpn-class/kernel:0 => (1, 1, 256, 3) [MaskRCNN] INFO : [#0179] rpn-class/bias:0 => (3,) [MaskRCNN] INFO : [#0180] rpn-box/kernel:0 => (1, 1, 256, 12) [MaskRCNN] INFO : [#0181] rpn-box/bias:0 => (12,) [MaskRCNN] INFO : [#0182] fc6/kernel:0 => (12544, 1024) [MaskRCNN] INFO : [#0183] fc6/bias:0 => (1024,) [MaskRCNN] INFO : [#0184] fc7/kernel:0 => (1024, 1024) [MaskRCNN] INFO : [#0185] fc7/bias:0 => (1024,) [MaskRCNN] INFO : [#0186] class-predict/kernel:0 => (1024, 19) [MaskRCNN] INFO : [#0187] class-predict/bias:0 => (19,) [MaskRCNN] INFO : [#0188] box-predict/kernel:0 => (1024, 76) [MaskRCNN] INFO : [#0189] box-predict/bias:0 => (76,) [MaskRCNN] INFO : [#0190] mask-conv-l0/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0191] mask-conv-l0/bias:0 => (256,) [MaskRCNN] INFO : [#0192] mask-conv-l1/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0193] mask-conv-l1/bias:0 => (256,) [MaskRCNN] INFO : [#0194] mask-conv-l2/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0195] mask-conv-l2/bias:0 => (256,) [MaskRCNN] INFO : [#0196] mask-conv-l3/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0197] mask-conv-l3/bias:0 => (256,) [MaskRCNN] INFO : [#0198] conv5-mask/kernel:0 => (2, 2, 256, 256) [MaskRCNN] INFO : [#0199] conv5-mask/bias:0 => (256,) [MaskRCNN] INFO : [#0200] mask_fcn_logits/kernel:0 => (1, 1, 256, 19) [MaskRCNN] INFO : [#0201] mask_fcn_logits/bias:0 => (19,) [MaskRCNN] INFO : %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% [MaskRCNN] INFO : # ============================================= # [MaskRCNN] INFO : Start Training [MaskRCNN] INFO : # %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% # [GPU 00] Restoring pretrained weights (265 Tensors) [MaskRCNN] INFO : Pretrained weights loaded with success... [MaskRCNN] INFO : Saving checkpoints for 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt. f099875a6032:174:700 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0> f099875a6032:174:700 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation f099875a6032:174:700 [0] NCCL INFO NET/IB : No device found. f099875a6032:174:700 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.14<0> f099875a6032:174:700 [0] NCCL INFO Using network Socket NCCL version 2.9.9+cuda11.3 f099875a6032:176:695 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0> f099875a6032:176:695 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation f099875a6032:176:695 [0] NCCL INFO NET/IB : No device found. f099875a6032:176:695 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.14<0> f099875a6032:176:695 [0] NCCL INFO Using network Socket f099875a6032:175:703 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0> f099875a6032:175:703 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation f099875a6032:175:703 [0] NCCL INFO NET/IB : No device found. f099875a6032:175:703 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.14<0> f099875a6032:175:703 [0] NCCL INFO Using network Socket f099875a6032:177:694 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0> f099875a6032:177:694 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation f099875a6032:177:694 [0] NCCL INFO NET/IB : No device found. f099875a6032:177:694 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.14<0> f099875a6032:177:694 [0] NCCL INFO Using network Socket f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000 f099875a6032:174:700 [0] NCCL INFO Channel 00/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Channel 01/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000 f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-2092d17be727dd49-1-0-1 (size 9637888) f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-cbca92963930bd4c-1-1-2 (size 9637888) f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:177:694 [0] NCCL INFO Channel 00 : 3[ca000] -> 0[31000] via direct shared memory f099875a6032:177:694 [0] NCCL INFO Channel 01 : 3[ca000] -> 0[31000] via direct shared memory f099875a6032:174:700 [0] NCCL INFO Channel 00 : 0[31000] -> 1[4b000] via direct shared memory f099875a6032:174:700 [0] NCCL INFO Channel 01 : 0[31000] -> 1[4b000] via direct shared memory f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:177:694 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:174:700 [0] NCCL INFO Call to connect returned Connection refused, retrying f099875a6032:177:694 [0] include/socket.h:406 NCCL WARN Connect to 127.0.0.1<55501> failed : Connection refused f099875a6032:177:694 [0] NCCL INFO bootstrap.cc:418 -> 2 f099875a6032:177:694 [0] NCCL INFO transport.cc:102 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:174:700 [0] include/socket.h:406 NCCL WARN Connect to 127.0.0.1<38407> failed : Connection refused f099875a6032:174:700 [0] NCCL INFO bootstrap.cc:418 -> 2 f099875a6032:174:700 [0] NCCL INFO transport.cc:103 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff f099875a6032:174:700 [0] NCCL INFO Channel 00/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Channel 01/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000 f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000 f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-a3e13d13d6ddcceb-0-0-1 (size 9637888) f099875a6032:177:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:177:694 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:177:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-1e87390955df3bed-0-2-3 (size 9637888) f099875a6032:177:694 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:177:694 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:177:694 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:174:700 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-4f18fe2e28e6acee-0-1-2 (size 9637888) f099875a6032:174:700 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:174:700 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-41d5f6d8ab741cec-0-3-0 (size 9637888) f099875a6032:174:700 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:174:700 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:174:700 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000 f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000 f099875a6032:174:700 [0] NCCL INFO Channel 00/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Channel 01/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-1b53bb07d0f0e07e-0-1-2 (size 9637888) f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:177:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:177:694 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:177:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-eac1f5e2fde96f7d-0-2-3 (size 9637888) f099875a6032:177:694 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:177:694 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:177:694 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:174:700 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:174:700 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:174:700 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-e10b3b2537e507c-0-3-0 (size 9637888) f099875a6032:174:700 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:174:700 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-701bf9ed7ee8007b-0-0-1 (size 9637888) f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:174:700 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000 f099875a6032:174:700 [0] NCCL INFO Channel 00/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Channel 01/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000 f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff f099875a6032:177:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:177:694 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:177:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-4f70ce1a41ae6589-0-2-3 (size 9637888) f099875a6032:177:694 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:177:694 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:177:694 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:174:700 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:174:700 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:174:700 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-72bf8be997434688-0-3-0 (size 9637888) f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-d4cad224c2acf687-0-0-1 (size 9637888) f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-8002933f14b5d68a-0-1-2 (size 9637888) f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:174:700 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:174:700 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:174:700 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000 f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000 f099875a6032:174:700 [0] NCCL INFO Channel 00/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Channel 01/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff f099875a6032:177:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:177:694 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:177:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-570d47364fcfa72-0-2-3 (size 9637888) f099875a6032:177:694 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:177:694 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:177:694 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-8acad87de5fb8b70-0-0-1 (size 9637888) f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-3602999838046b73-0-1-2 (size 9637888) f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:174:700 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:174:700 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:174:700 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-28bf9242ba91db71-0-3-0 (size 9637888) f099875a6032:174:700 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:174:700 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:174:700 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000 f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000 f099875a6032:174:700 [0] NCCL INFO Channel 00/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Channel 01/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-7769607d24ae892c-0-1-2 (size 9637888) f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:177:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:177:694 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:177:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-46d79b5851a7182b-0-2-3 (size 9637888) f099875a6032:177:694 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:177:694 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:177:694 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-cc319f62d2a5a929-0-0-1 (size 9637888) f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:174:700 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:174:700 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:174:700 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-6a265927a73bf92a-0-3-0 (size 9637888) f099875a6032:174:700 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:174:700 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:174:700 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:174:700 [0] NCCL INFO Channel 00/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Channel 01/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000 f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000 f099875a6032:174:700 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:174:700 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:174:700 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-cb571eacf1e9f55e-0-3-0 (size 9637888) f099875a6032:174:700 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:174:700 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:174:700 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-2d6264e81d53a55d-0-0-1 (size 9637888) f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-d89a26026f5c8560-0-1-2 (size 9637888) f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:177:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:177:694 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:177:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-a80860dd9c55145f-0-2-3 (size 9637888) f099875a6032:177:694 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:177:694 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:177:694 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff f099875a6032:174:700 [0] NCCL INFO Channel 00/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Channel 01/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000 f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000 f099875a6032:177:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:177:694 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:177:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-14610b51518e9b34-0-2-3 (size 9637888) f099875a6032:177:694 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:177:694 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:177:694 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-99bb0f5bd28d2c32-0-0-1 (size 9637888) f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:174:700 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:174:700 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:174:700 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-37afc920a7237c33-0-3-0 (size 9637888) f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-44f2d07624960c35-0-1-2 (size 9637888) f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:174:700 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:174:700 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:174:700 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000 f099875a6032:174:700 [0] NCCL INFO Channel 00/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Channel 01/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000 f099875a6032:177:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:177:694 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:177:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-53be18d8f7231dc6-0-2-3 (size 9637888) f099875a6032:177:694 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:177:694 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:177:694 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-d9181ce37821aec4-0-0-1 (size 9637888) f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:174:700 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:174:700 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:174:700 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-770cd6a84cb7fec5-0-3-0 (size 9637888) f099875a6032:174:700 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:174:700 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:174:700 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-844fddfdca2a8ec7-0-1-2 (size 9637888) f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:176:695 [0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 f099875a6032:176:695 [0] NCCL INFO Setting affinity for GPU 2 to ff,ffffc000,000fffff,fc000000 f099875a6032:177:694 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 f099875a6032:177:694 [0] NCCL INFO Setting affinity for GPU 3 to ff,ffffc000,000fffff,fc000000 f099875a6032:175:703 [0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 f099875a6032:175:703 [0] NCCL INFO Setting affinity for GPU 1 to 3fff,fff00000,03ffffff f099875a6032:174:700 [0] NCCL INFO Channel 00/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Channel 01/02 : 0 1 2 3 f099875a6032:174:700 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 f099875a6032:174:700 [0] NCCL INFO Setting affinity for GPU 0 to 3fff,fff00000,03ffffff f099875a6032:175:703 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:175:703 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:175:703 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-52eb468ab391d306-0-0-1 (size 9637888) f099875a6032:175:703 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:175:703 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:175:703 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:177:694 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:177:694 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:177:694 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-cd91428032934208-0-2-3 (size 9637888) f099875a6032:177:694 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:177:694 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:177:694 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:177:694 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:174:700 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:174:700 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:174:700 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-f0e0004f88282307-0-3-0 (size 9637888) f099875a6032:174:700 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:174:700 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:174:700 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:176:695 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device f099875a6032:176:695 [0] NCCL INFO include/shm.h:41 -> 2 f099875a6032:176:695 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-fe2307a5059ab309-0-1-2 (size 9637888) f099875a6032:176:695 [0] NCCL INFO transport/shm.cc:100 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:34 -> 2 f099875a6032:176:695 [0] NCCL INFO transport.cc:84 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:742 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:867 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:176:695 [0] NCCL INFO init.cc:916 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:903 -> 2 f099875a6032:174:700 [0] NCCL INFO init.cc:916 -> 2 Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error [[{{node DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0}}]] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 399, in train_and_eval File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run run_metadata=run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run run_metadata=run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run raise six.reraise(*original_exc_info) File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise raise value File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run run_metadata=run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error [[node DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] Original stack trace for 'DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0': File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 399, in train_and_eval File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default features, labels, ModeKeys.TRAIN, self.config) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn model_fn_results = self._model_fn(features=features, **kwargs) File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 686, in mask_rcnn_model_fn File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 628, in _model_fn File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 475, in compute_gradients avg_grads = self._allreduce_grads(grads, vars) File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 398, in allreduce_grads for grad in grads] File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 398, in for grad in grads] File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 241, in _allreduce_cond allreduce_fn, id_fn) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1224, in cond orig_res_t, res_t = context_t.BuildCondBranch(true_fn) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1061, in BuildCondBranch original_result = fn() File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 235, in allreduce_fn return allreduce(tensor, *args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 123, in allreduce name=name) File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 121, in _allreduce ignore_name_scope=ignore_name_scope) File "", line 102, in horovod_allreduce File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__ self._traceback = tf_stack.extract_stack() Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error [[{{node DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0}}]] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 399, in train_and_eval File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run run_metadata=run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run run_metadata=run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run raise six.reraise(*original_exc_info) File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise raise value File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run run_metadata=run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error [[node DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] Original stack trace for 'DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0': File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 399, in train_and_eval File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default features, labels, ModeKeys.TRAIN, self.config) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn model_fn_results = self._model_fn(features=features, **kwargs) File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 686, in mask_rcnn_model_fn File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 628, in _model_fn File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 475, in compute_gradients avg_grads = self._allreduce_grads(grads, vars) File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 398, in allreduce_grads for grad in grads] File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 398, in for grad in grads] File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 241, in _allreduce_cond allreduce_fn, id_fn) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1224, in cond orig_res_t, res_t = context_t.BuildCondBranch(true_fn) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1061, in BuildCondBranch original_result = fn() File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 235, in allreduce_fn return allreduce(tensor, *args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 123, in allreduce name=name) File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 121, in _allreduce ignore_name_scope=ignore_name_scope) File "", line 102, in horovod_allreduce File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__ self._traceback = tf_stack.extract_stack() Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error [[{{node DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0}}]] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 399, in train_and_eval File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run run_metadata=run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run run_metadata=run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run raise six.reraise(*original_exc_info) File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise raise value File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run run_metadata=run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error [[node DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] Original stack trace for 'DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0': File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 399, in train_and_eval File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default features, labels, ModeKeys.TRAIN, self.config) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn model_fn_results = self._model_fn(features=features, **kwargs) File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 686, in mask_rcnn_model_fn File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 628, in _model_fn File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 475, in compute_gradients avg_grads = self._allreduce_grads(grads, vars) File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 398, in allreduce_grads for grad in grads] File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 398, in for grad in grads] File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 241, in _allreduce_cond allreduce_fn, id_fn) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1224, in cond orig_res_t, res_t = context_t.BuildCondBranch(true_fn) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1061, in BuildCondBranch original_result = fn() File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 235, in allreduce_fn return allreduce(tensor, *args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 123, in allreduce name=name) File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 121, in _allreduce ignore_name_scope=ignore_name_scope) File "", line 102, in horovod_allreduce File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__ self._traceback = tf_stack.extract_stack() Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error [[{{node DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0}}]] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 399, in train_and_eval File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run run_metadata=run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run run_metadata=run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run raise six.reraise(*original_exc_info) File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise raise value File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run run_metadata=run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error [[node DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] Original stack trace for 'DistributedMomentumOptimizer_Allreduce/cond_186/HorovodAllreduce_gradients_class_predict_BiasAdd_grad_tuple_control_dependency_1_0': File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 399, in train_and_eval File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default features, labels, ModeKeys.TRAIN, self.config) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn model_fn_results = self._model_fn(features=features, **kwargs) File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 686, in mask_rcnn_model_fn File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 628, in _model_fn File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 475, in compute_gradients avg_grads = self._allreduce_grads(grads, vars) File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 398, in allreduce_grads for grad in grads] File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 398, in for grad in grads] File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 241, in _allreduce_cond allreduce_fn, id_fn) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1224, in cond orig_res_t, res_t = context_t.BuildCondBranch(true_fn) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1061, in BuildCondBranch original_result = fn() File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 235, in allreduce_fn return allreduce(tensor, *args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 123, in allreduce name=name) File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 121, in _allreduce ignore_name_scope=ignore_name_scope) File "", line 102, in horovod_allreduce File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__ self._traceback = tf_stack.extract_stack() [MaskRCNN] INFO : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ # [MaskRCNN] INFO : Training Performance Summary [MaskRCNN] INFO : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ # DLL 2022-06-09 08:37:38.494784 - : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ # DLL 2022-06-09 08:37:38.495021 - : Training Performance Summary DLL 2022-06-09 08:37:38.495066 - : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ # DLL 2022-06-09 08:37:38.495115 - Average_throughput : -1.0 samples/sec DLL 2022-06-09 08:37:38.495155 - Total processed steps : 1 DLL 2022-06-09 08:37:38.495211 - Total_processing_time : 0h 00m 00s [MaskRCNN] INFO : Average throughput: -1.0 samples/sec [MaskRCNN] INFO : Total processed steps: 1 [MaskRCNN] INFO : Total processing time: 0h 00m 00s DLL 2022-06-09 08:37:38.495463 - : ==================== Metrics ==================== [MaskRCNN] INFO : ==================== Metrics ==================== [MaskRCNN] ERROR : Job finished with an uncaught exception: `FAILURE`