Cannot reshape a tensor with 25690112 elements to shape [256,256,14,14]

I have upgraded to latest TAO version nvidia-tao==0.1.24.
When I train for multiple GPUs, I have error as

Cannot reshape a tensor with 25690112 elements to shape [256,256,14,14] (12845056 elements) for 'mask_head_reshape_1/mask_head_reshape_1' (op: 'Reshape') with input shapes: [4,128,256,14,14], [4] and with input tensors computed as partial shapes: input[1] = [256,256,14,14].

My spec file is attached.
maskrcnn_retrain_resnet50.txt (2.1 KB)

The whole errors are as follows.

/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/layers/reshape_layer.py:25 call
        
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/array_ops.py:131 reshape
        result = gen_array_ops.reshape(tensor, shape, name)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_array_ops.py:8115 reshape
        "Reshape", tensor=tensor, shape=shape, name=name)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py:794 _apply_op_helper
        op_def=op_def)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py:513 new_func
        return func(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:3357 create_op
        attrs, op_def, compute_device)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:3426 _create_op_internal
        op_def=op_def)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1770 __init__
        control_input_ops)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1610 _create_c_op
        raise ValueError(str(e))

    ValueError: Cannot reshape a tensor with 25690112 elements to shape [256,256,14,14] (12845056 elements) for 'mask_head_reshape_1/mask_head_reshape_1' (op: 'Reshape') with input shapes: [4,128,256,14,14], [4] and with input tensors computed as partial shapes: input[1] = [256,256,14,14].

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 254, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 250, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 237, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 88, in run_executer
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 418, in train_and_eval
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._trainfined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
ocal/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py:794 _apply_op_helper
        op_def=op_def)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py:513 new_func
        return func(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:3357 create_op
        attrs, op_def, compute_device)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:3426 _create_op_internal
        op_def=op_def)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1770 __init__
        control_input_ops)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1610 _create_c_op
        raise ValueError(str(e))

    ValueError: Cannot reshape a tensor with 25690112 elements to shape [256,256,14,14] (12845056 elements) for 'mask_head_reshape_1/mask_head_reshape_1' (op: 'Reshape') with input shapes: [4,128,256,14,14], [4] and with input tensors computed as partial shapes: input[1] = [256,256,14,14].

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 254, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 250, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 237, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 88, in run_executer
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 418, in train_and_eval
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 699, in mask_rcnn_model_fn
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 533, in _model_fn
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 187, in build_model_graph
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/utils/model_loader.py", line 104, in get_model_with_input
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/saving/model_config.py", line 92, in model_from_json
    re_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 699, in mask_rcnn_model_fn
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 533, in _model_fn
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 187, in build_model_graph
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/utils/model_loader.py", line 104, in get_model_with_input
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/saving/model_config.py", line 92, in model_from_json
    return deserialize(config, custom_objects=custom_objects)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/serialization.py", line 105, in deserialize
    printable_module_name='layer')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/utils/generic_utils.py", line 191, in deserialize_keras_object
    list(custom_objects.items())))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/network.py", line 1076, in from_config
    process_node(layer, node_data)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/network.py", line 1034, in process_node
    layer(input_tensors, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 237, in wrapper
    raise e.ag_error_metadata.to_exception(e)
ValueError: in converted code:

    /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/layers/reshape_layer.py:25 call
        
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/array_ops.py:131 reshape
        result = gen_array_ops.reshape(tensor, shape, name)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_array_ops.py:8115 reshape
        "Reshape", tensor=tensor, shape=shape, name=name)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py:794 _apply_op_helper
        op_def=op_def)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py:513 new_func
        return func(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:3357 create_op
        attrs, op_def, compute_device)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:3426 _create_op_internal
        op_def=op_def)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1770 __init__
        control_input_ops)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1610 _create_c_op
        raise Vturn deserialize(config, custom_objects=custom_objects)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/serialization.py", line 105, in deserialize
    printable_module_name='layer')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/utils/generic_utils.py", line 191, in deserialize_keras_object
    list(custom_objects.items())))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/network.py", line 1076, in from_config
    process_node(layer, node_data)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/network.py", line 1034, in process_node
    layer(input_tensors, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 237, in wrapper
    raise e.ag_error_metadata.to_exception(e)
ValueError: in converted code:

    /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/layers/reshape_layer.py:25 call
        
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/array_ops.py:131 reshape
        result = gen_array_ops.reshape(tensor, shape, name)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_array_ops.py:8115 reshape
        "Reshape", tensor=tensor, shape=shape, name=name)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py:794 _apply_op_helper
        op_def=op_def)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py:513 new_func
        return func(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:3357 create_op
        attrs, op_def, compute_device)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:3426 _create_op_internal
        op_def=op_def)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1770 __init__
        control_input_ops)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1610 _create_c_op
        raise ValueError(str(e))

    ValueError: Cannot reshape a tensor with 25690112 elements to shape [256,256,14,14] (12845056 elements) for 'mask_head_reshape_1/mask_head_reshape_1' (op: 'Reshape') with input shapes: [4,128,256,14,14], [4] and with input tensors computed as partial shapes: input[1] = [256,256,14,14].

alueError(str(e))

    ValueError: Cannot reshape a tensor with 25690112 elements to shape [256,256,14,14] (12845056 elements) for 'mask_head_reshape_1/mask_head_reshape_1' (op: 'Reshape') with input shapes: [4,128,256,14,14], [4] and with input tensors computed as partial shapes: input[1] = [256,256,14,14].

--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[45090,1],1]
  Exit code:    1
--------------------------------------------------------------------------

2022-06-23 13:58:02,724 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

When batch_size is changed to 2. The error is removed. But I can train with only one GPU. Multiple GPUs can’t be trained with new tao release.

Do you mean 22.05 release?

Yes this is the latest one upgraded.

nvidia-tao==0.1.24

nvcr.io/nvidia/tao/tao-toolkit-tf                      v3.22.05-tf1.15.5-py3

My nvidia-smi displays all GPUs. Why can’t i train with multiple GPUs. It used to be working before.

Any log for training with multi gpus?

I’ll update soon since I am training with single GPU now.

All logs for error are shown below.

For multi-GPU, change --gpus based on your machine.
2022-06-25 03:13:01,429 [INFO] root: Registry: ['nvcr.io']
2022-06-25 03:13:01,676 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
2022-06-25 03:13:01,809 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/sysadmin/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
[INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet18.txt
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
[INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet18.txt
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
[INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet18.txt
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
[INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet18.txt
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 254, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 154, in main
AssertionError: num_examples_per_epoch must be specified.         It should be the total number of images in the training set divided by the number of GPUs.
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 254, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 154, in main
AssertionError: num_examples_per_epoch must be specified.         It should be the total number of images in the training set divided by the number of GPUs.
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 254, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 154, in main
AssertionError: num_examples_per_epoch must be specified.         It should be the total number of images in the training set divided by the number of GPUs.
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 254, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 154, in main
AssertionError: num_examples_per_epoch must be specified.         It should be the total number of images in the training set divided by the number of GPUs.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[47769,1],2]
  Exit code:    1
--------------------------------------------------------------------------
2022-06-25 03:13:14,201 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Please set num_examples_per_epoch

I set. But still the same.
Please see log file and spec file.
log (68.3 KB)
maskrcnn_train_resnet18.txt (2.1 KB)

Training with one GPU is fine. Problem is training with 4 GPUs.

From the latest log, it is not the same error.

[GPU 00] Restoring pretrained weights (105 Tensors)
[MaskRCNN] INFO    : Pretrained weights loaded with success...
    
[MaskRCNN] INFO    : Saving checkpoints for 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node ac48cf28eceb exited on signal 9 (Killed).
--------------------------------------------------------------------------
2022-06-25 08:42:31,024 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Can you open a terminal instead of running in notebook? Also please run with a new folder as well. See below.

$ tao mask_rcnn train -e $SPECS_DIRmaskrcnn_train_resnet18.txt -d $USER_EXPERIMENT_DIR/experiment_dir_unpruned_new -k $KEY --gpus 4

I haven’t tried running from a terminal.
Am I supposed to run using tao docker?
When I run from the terminal for the following command,

(launcher) (base) sysadmin@workstation:~/Nyan/cv_samples_v1.3.0$ tao mask_rcnn train -e /home/sysadmin/Nyan/cv_samples_v1.3.0/mask_rcnn/specs/maskrcnn_train_resnet18.txt -d /home/sysadmin/Nyan/cv_samples_v1.3.0/mask_rcnn/experiment_dir_unpruned -k nvidia_tlt --gpus 4

The error is it can’t find spec file.

Loading specification from /home/sysadmin/Nyan/cv_samples_v1.3.0/mask_rcnn/specs/maskrcnn_train_resnet18.txt
 File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 254, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 104, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/utils/spec_loader.py", line 53, in load_experiment_spec
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/utils/spec_loader.py", line 36, in load_proto
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/utils/spec_loader.py", line 30, in _load_from_file
FileNotFoundError: [Errno 2] No such file or directory: '/home/sysadmin/Nyan/cv_samples_v1.3.0/mask_rcnn/specs/maskrcnn_train_resnet18.txt'

What should I do?
The path is correct.

Can you share your ~/.tao_mounts.json?

Did you ever vim the ~/.tao_mounts.json according to TAO Toolkit Launcher — TAO Toolkit 3.22.05 documentation ?

Sorry I misunderstood. Yes vim shows

{
    "Mounts": [
        {
            "source": "/home/sysadmin/Nyan/cv_samples_v1.3.0",
            "destination": "/workspace/tao-experiments"
        },
        {
            "source": "/home/sysadmin/Nyan/cv_samples_v1.3.0/mask_rcnn/specs",
            "destination": "/workspace/tao-experiments/mask_rcnn/specs"
        }
    ],
    "DockerOptions": {
        "shm_size": "32G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
        }
    }
}

So, please modify above command. Please note that the ~/.tao_mounts.json is in order to map your local files to the docker. The path should be a path inside the docker when you type the tao command.

(launcher) (base) sysadmin@workstation:~/Nyan/cv_samples_v1.3.0$ tao mask_rcnn train -e /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet18.txt -d /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned_new -k nvidia_tlt --gpus 4

Changed to this command.
tao mask_rcnn train -e /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet18.txt -d /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned -k nvidia_tlt --gpus 4

The error is as follows running from terminal.

INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[GPU 00] Restoring pretrained weights (105 Tensors)
[MaskRCNN] INFO    : Pretrained weights loaded with success...

[MaskRCNN] INFO    : Saving checkpoints for 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node 2ba4e56ebcdf exited on signal 9 (Killed).
--------------------------------------------------------------------------
2022-06-25 10:20:05,764 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Can you try experiment_dir_unpruned_new ?

I clean everything in the folder.

To debug more, can you try to run inside the docker?
$ tao mask_rcnn run /bin/bash

then, run 2gpus,
# mask_rcnn train -e /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet18.txt -d /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned_2gpu -k nvidia_tlt --gpus 2

and then try to run 4gpus.
# mask_rcnn train -e /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet18.txt -d /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned_4gpu -k nvidia_tlt --gpus 4

2 Gpus worked.
4 Gpus didn’t.