TLT train maskrcnn model with Mapillary Vistas Dataset failed on CUDA_ERROR_OUT_OF_MEMORY: out of memory

It is not necessary. I just use a dummy caption file.
Yes, you can commented the code related with captions in create_coco_tf_record.py.

One more question. When I succeeded in training with your tfrecords and json. I set my spec like:

data_config{

training_file_pattern: “/workspace/tlt-experiments/mapillary/result0525_images_500_resize/train*.tfrecord”
validation_file_pattern: “/workspace/tlt-experiments/mapillary/tf_resized/val*.tfrecord”
val_json_file: “/workspace/tlt-experiments/mapillary/annotations/instances_random500_shape_validation2020_resize.json”

}

I thought the instances_random500_shape_validation2020_resize.json was the validation annotations file as the val_json_file should set to be. But from your tfrecords generating command, the instances_random500_shape_validation2020_resize.json is the train annotations file?

I just generate tfrecords and their corresponding json file to check OOM issue. No matter it is used as training file or validation file.

As you may know, I’ve picked 1000 random images from Vistas for training and 500 images for validation. I followed the same steps as you did to get the 1/8 resized tfrecords and json. If I trained with the 1000 images, I got the error messages as below. Not sure if it’s still related with memory. But if I trained with that 500 images for validation, everything was OK. So is it related with the data amount?

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_15647}} Input to reshape is a tensor with 3525472 values, but the requested shape has 2691200
         [[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
         [[IteratorGetNext]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 196, in <module>
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 192, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 91, in run_executer
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 394, in train_and_eval
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError:  Input to reshape is a tensor with 3525472 values, but the requested shape has 2691200
         [[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
         [[IteratorGetNext]]
[MaskRCNN] ERROR   : Job finished with an uncaught exception: `FAILURE`
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
         [[{{node DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_178_0}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 196, in <module>
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 192, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 91, in run_executer
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 394, in train_and_eval
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
         [[node DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_178_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_178_0':
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 196, in <module>
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 192, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 91, in run_executer
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 394, in train_and_eval
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 548, in mask_rcnn_model_fn
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 493, in _model_fn
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 256, in compute_gradients
    avg_grads = self._allreduce_grads(grads)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 210, in allreduce_grads
    for grad in grads]
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 210, in <listcomp>
    for grad in grads]
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 80, in allreduce
    summed_tensor_compressed = _allreduce(tensor_compressed)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 86, in _allreduce
    return MPI_LIB.horovod_allreduce(tensor, name=name)
  File "<string>", line 80, in horovod_allreduce
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
         [[{{node DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_178_0}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 196, in <module>
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 192, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 91, in run_executer
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 394, in train_and_eval
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
         [[node DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_178_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_178_0':
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 196, in <module>
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 192, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 91, in run_executer
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 394, in train_and_eval
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 548, in mask_rcnn_model_fn
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 493, in _model_fn
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 256, in compute_gradients
    avg_grads = self._allreduce_grads(grads)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 210, in allreduce_grads
    for grad in grads]
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 210, in <listcomp>
    for grad in grads]
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 80, in allreduce
    summed_tensor_compressed = _allreduce(tensor_compressed)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 86, in _allreduce
    return MPI_LIB.horovod_allreduce(tensor, name=name)
  File "<string>", line 80, in horovod_allreduce
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[47062,1],0]
  Exit code:    1
--------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/bin/mask_rcnn", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/entrypoint/mask_rcnn.py", line 12, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.
2021-06-02 11:21:09,128 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Can you double check your tfrecords files for 1000 images? Do they mix up the tfrecords files of 500 images?
BTW, can you share the 1000 images along with their json file?

I think I’ve figured out what’s the problem is. It’s about the size of single tfrecord file. Your random 500 images’ tfrecords were generated by the default num_shards in create_coco_tf_record.py, which is 256. I also generated my 1000 images by the default num_shards which made the tfrecord file was about double size.

I increased the num_shards in create_coco_tf_record.py to 512 and re-generated the tfrecords. There was no error messages any more.

My question is that the single tfrecord file size about ~300k could make the training go ahead now. But the single tfrecord file in COCO dataset is ~80M. Why does that happen?

Can yo set to default num_shards 256, and generate tfrecords for the 1000 images , then retry again?

Sorry, my mistake. The num_shards didn’t help me with my 1000 images training. It was the training_file_pattern in spec I set to val*.tfrecord (which generated from 500 val images with num_shards 256) made the training succeeded.

Here’s the 1/8 resized 1000 images:
https://drive.google.com/file/d/1ymqOKKFN3u8qmHTYlqyBumIAVqXOeMck/view?usp=sharing

Json:
https://drive.google.com/file/d/16WpE_Pi0M_dnPtp_UmDvMfR4-l3fNtZH/view?usp=sharing

Generated tfrecords:
https://drive.google.com/file/d/1ocz7NADPwkXQPaAqECCcirv8OLFvQ8x2/view?usp=sharing

Hi,

It seems the problem was from the shape mismatch. May I know which paramater could I set to increase the requested shape?

tensorflow.python.framework.errors_impl.InvalidArgumentError:  Input to reshape is a tensor with 3525472 values, but the requested shape has 2691200
         [[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
         [[IteratorGetNext]]

What is your training spec? The mismatching may result from dataset. We need to inspect.

My spec:

  seed: 123
    use_amp: False
    warmup_steps: 1000
    checkpoint: "/workspace/tlt-experiments/maskrcnn/pretrained_resnet50/tlt_instance_segmentation_vresnet50/resnet50.hdf5"
    learning_rate_steps: "[10000, 15000, 20000]"
    learning_rate_decay_levels: "[0.1, 0.02, 0.01]"
    total_steps: 25000
    train_batch_size: 1
    eval_batch_size: 1
    num_steps_per_eval: 5000
    momentum: 0.9
    l2_weight_decay: 0.0001
    warmup_learning_rate: 0.0001
    init_learning_rate: 0.01

    data_config{
        image_size: "(128, 128)"#"(832, 1344)"
        augment_input_data: True
        eval_samples: 500
        training_file_pattern: "/workspace/tlt-experiments/mapillary/tf_resized/train*.tfrecord"
        validation_file_pattern: "/workspace/tlt-experiments/mapillary/tf_resized/val*.tfrecord"
        val_json_file: "/workspace/tlt-experiments/mapillary/annotations/instances_random500_shape_validation2020.json"

        # dataset specific parameters
        num_classes: 124
        skip_crowd_during_training: True
    }

    maskrcnn_config {
        nlayers: 50
        arch: "resnet"
        freeze_bn: True
        freeze_blocks: "[0,1]"
        gt_mask_size: 112
            
        # Region Proposal Network
        rpn_positive_overlap: 0.7
        rpn_negative_overlap: 0.3
        rpn_batch_size_per_im: 256
        rpn_fg_fraction: 0.5
        rpn_min_size: 0.

        # Proposal layer.
        batch_size_per_im: 512
        fg_fraction: 0.25
        fg_thresh: 0.5
        bg_thresh_hi: 0.5
        bg_thresh_lo: 0.

        # Faster-RCNN heads.
        fast_rcnn_mlp_head_dim: 1024
        bbox_reg_weights: "(10., 10., 5., 5.)"

        # Mask-RCNN heads.
        include_mask: True
        mrcnn_resolution: 28

        # training
        train_rpn_pre_nms_topn: 2000
        train_rpn_post_nms_topn: 1000
        train_rpn_nms_threshold: 0.7

        # evaluation
        test_detections_per_image: 100
        test_nms: 0.5
        test_rpn_pre_nms_topn: 1000
        test_rpn_post_nms_topn: 1000
        test_rpn_nms_thresh: 0.7

        # model architecture
        min_level: 2
        max_level: 6
        num_scales: 1
        aspect_ratios: "[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]"
        anchor_scale: 8

        # localization loss
        rpn_box_loss_weight: 1.0
        fast_rcnn_box_loss_weight: 1.0
        mrcnn_weight_loss_mask: 1.0
    }

Could you also try to verify below too?
500 images: no issue
750 images: check if there is mismatching issue? If yes, how about 600 images, 550 images, …?

I run experiments on your tfrecords.
No issue when I tried
“train-0000*-of-00256.tfrecord” or
“train-0001*-of-00256.tfrecord” or
“train-0002*-of-00256.tfrecord” or
“train-0003*-of-00256.tfrecord”.

But, when I tried “train-0004*-of-00256.tfrecord” (totally 10 tfrecords files), the mismatching issue happened.

training_file_pattern: “/workspace/demo_3.0/maskrcnn_cvat/tfrecords_OOM/random_1000/tfrecords/train-0004*-of-00256.tfrecord”

So, please use above way to narrow down the issue. There should be something wrong in some tfrecords files.

I’ve tried to pick up randomly 750 images from my 1000 images, then 500 images from that 750 images, and even 200 images from 500 images, all of them had mismatching issue…
I also tried to pick up another 2 sets of 1000 images randomly from Mapillary Vistas’s 18000 trainig images, also failed in mismatching issue…

Please see above experiments from me, the /demo_3.0/maskrcnn_cvat/tfrecords_OOM/random_1000/tfrecords/train-0004*-of-00256.tfrecord” has mismatching issue.

No issue when I tried
“train-0000*-of-00256.tfrecord” or
“train-0001*-of-00256.tfrecord” or
“train-0002*-of-00256.tfrecord” or
“train-0003*-of-00256.tfrecord”.

Is there something wrong with the images in train-0004*-of-00256.tfrecord? But all the tfrecord are transformed in the same way. And the no issue 500 val images also followed this way to get the tfrecord.

Firstly, please check if you can get the same result as mine.

I tried you experimets as above(with --gpus 1 in command line),
training_file_pattern: “/workspace/tlt-experiments/mapillary/tf_resized/train-0001*-of-00256.tfrecord”
or
training_file_pattern: “/workspace/tlt-experiments/mapillary/tf_resized/train-0004*-of-00256.tfrecord”
Instead of mismatching issue, I got the following error:

[MaskRCNN] INFO    : # ============================================= #
[MaskRCNN] INFO    :                  Start Training
[MaskRCNN] INFO    : # %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% #

[GPU 00] Restoring pretrained weights (307 Tensors) from: /tmp/tmpamkb67yb/model.ckpt-5000
[MaskRCNN] INFO    : Pretrained weights loaded with success...

[MaskRCNN] INFO    : Saving checkpoints for 5000 into /workspace/tlt-experiments/maskrcnn/experiment_dir_unpruned/model.step-5000.tlt.
[MaskRCNN] INFO    : timestamp: 1623727825.9933314
[MaskRCNN] INFO    : iteration: 5005
DLL 2021-06-15 03:30:25.994966 -  iteration : 5005
[MaskRCNN] INFO    : throughput: 1.5 samples/sec
DLL 2021-06-15 03:30:25.995717 - Iteration: 5005  throughput : 1.534986104559237
[MaskRCNN] INFO    : ==================== Metrics =====================
[MaskRCNN] INFO    : FastRCNN box loss: 0.25521
DLL 2021-06-15 03:30:25.997679 - Iteration: 5005  FastRCNN box loss : 0.25521
[MaskRCNN] INFO    : FastRCNN class loss: 1.43679
DLL 2021-06-15 03:30:25.998058 - Iteration: 5005  FastRCNN class loss : 1.43679
[MaskRCNN] INFO    : FastRCNN total loss: 1.692
DLL 2021-06-15 03:30:25.998390 - Iteration: 5005  FastRCNN total loss : 1.692
[MaskRCNN] INFO    : L2 loss: 2.05043
DLL 2021-06-15 03:30:25.998713 - Iteration: 5005  L2 loss : 2.05043
[MaskRCNN] INFO    : Learning rate: 0.01
DLL 2021-06-15 03:30:25.999047 - Iteration: 5005  Learning rate : 0.01
[MaskRCNN] INFO    : Mask loss: 1.20826
DLL 2021-06-15 03:30:25.999371 - Iteration: 5005  Mask loss : 1.20826
[MaskRCNN] INFO    : RPN box loss: 0.13479
DLL 2021-06-15 03:30:25.999682 - Iteration: 5005  RPN box loss : 0.13479
[MaskRCNN] INFO    : RPN score loss: 1.17163
DLL 2021-06-15 03:30:26 - Iteration: 5005  RPN score loss : 1.17163
[MaskRCNN] INFO    : RPN total loss: 1.30641
DLL 2021-06-15 03:30:26.000312 - Iteration: 5005  RPN total loss : 1.30641
[MaskRCNN] INFO    : Total loss: 6.25711
DLL 2021-06-15 03:30:26.000659 - Iteration: 5005  Total loss : 6.25711

[MaskRCNN] INFO    : timestamp: 1623727828.1025558
[MaskRCNN] INFO    : iteration: 5010
DLL 2021-06-15 03:30:28.103584 -  iteration : 5010
[MaskRCNN] INFO    : throughput: 2.0 samples/sec
DLL 2021-06-15 03:30:28.104061 - Iteration: 5010  throughput : 1.9517826742772044
[MaskRCNN] INFO    : ==================== Metrics =====================
[MaskRCNN] INFO    : FastRCNN box loss: 53.14843
DLL 2021-06-15 03:30:28.105638 - Iteration: 5010  FastRCNN box loss : 53.14843
[MaskRCNN] INFO    : FastRCNN class loss: 1403.56396
DLL 2021-06-15 03:30:28.105985 - Iteration: 5010  FastRCNN class loss : 1403.56396
[MaskRCNN] INFO    : FastRCNN total loss: 1456.7124
DLL 2021-06-15 03:30:28.106302 - Iteration: 5010  FastRCNN total loss : 1456.7124
[MaskRCNN] INFO    : L2 loss: 2.05362
DLL 2021-06-15 03:30:28.106613 - Iteration: 5010  L2 loss : 2.05362
[MaskRCNN] INFO    : Learning rate: 0.01
DLL 2021-06-15 03:30:28.106922 - Iteration: 5010  Learning rate : 0.01
[MaskRCNN] INFO    : Mask loss: 3.75593
DLL 2021-06-15 03:30:28.107217 - Iteration: 5010  Mask loss : 3.75593
[MaskRCNN] INFO    : RPN box loss: 1.09253
DLL 2021-06-15 03:30:28.107506 - Iteration: 5010  RPN box loss : 1.09253
[MaskRCNN] INFO    : RPN score loss: 79.7173
DLL 2021-06-15 03:30:28.107788 - Iteration: 5010  RPN score loss : 79.7173
[MaskRCNN] INFO    : RPN total loss: 80.80984
DLL 2021-06-15 03:30:28.108069 - Iteration: 5010  RPN total loss : 80.80984
[MaskRCNN] INFO    : Total loss: 1543.33179
DLL 2021-06-15 03:30:28.108349 - Iteration: 5010  Total loss : 1543.33179

ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 196, in <module>
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 192, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 91, in run_executer
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 394, in train_and_eval
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1426, in run
    run_metadata=run_metadata))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/basic_session_run_hooks.py", line 761, in after_run
    raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

[MaskRCNN] INFO    : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
[MaskRCNN] INFO    :           Training Performance Summary
[MaskRCNN] INFO    : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
DLL 2021-06-15 03:30:33.191428 -   : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
DLL 2021-06-15 03:30:33.191556 -   :           Training Performance Summary
DLL 2021-06-15 03:30:33.191611 -   : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #

DLL 2021-06-15 03:30:33.191688 -  Average_throughput : 2.1 samples/sec
DLL 2021-06-15 03:30:33.191740 -  Total processed steps : 5012
DLL 2021-06-15 03:30:33.191797 -  Total_processing_time : 0h 00m 00s
[MaskRCNN] INFO    : Average throughput: 2.1 samples/sec
[MaskRCNN] INFO    : Total processed steps: 5012
[MaskRCNN] INFO    : Total processing time: 0h 00m 00s
DLL 2021-06-15 03:30:33.192110 -   : ==================== Metrics ====================
[MaskRCNN] INFO    : ==================== Metrics ====================
[MaskRCNN] INFO    : FastRCNN box loss: 53.14843
DLL 2021-06-15 03:30:33.192554 -  FastRCNN box loss : 53.14843
[MaskRCNN] INFO    : FastRCNN class loss: 1403.56396
DLL 2021-06-15 03:30:33.192733 -  FastRCNN class loss : 1403.56396
[MaskRCNN] INFO    : FastRCNN total loss: 1456.7124
DLL 2021-06-15 03:30:33.192922 -  FastRCNN total loss : 1456.7124
[MaskRCNN] INFO    : L2 loss: 2.05362
DLL 2021-06-15 03:30:33.193088 -  L2 loss : 2.05362
[MaskRCNN] INFO    : Learning rate: 0.01
DLL 2021-06-15 03:30:33.193252 -  Learning rate : 0.01
[MaskRCNN] INFO    : Mask loss: 3.75593
DLL 2021-06-15 03:30:33.193422 -  Mask loss : 3.75593
[MaskRCNN] INFO    : RPN box loss: 1.09253
DLL 2021-06-15 03:30:33.193582 -  RPN box loss : 1.09253
[MaskRCNN] INFO    : RPN score loss: 79.7173
DLL 2021-06-15 03:30:33.193756 -  RPN score loss : 79.7173
[MaskRCNN] INFO    : RPN total loss: 80.80984
DLL 2021-06-15 03:30:33.193930 -  RPN total loss : 80.80984
[MaskRCNN] INFO    : Total loss: 1543.33179
DLL 2021-06-15 03:30:33.194088 -  Total loss : 1543.33179

[MaskRCNN] ERROR   : Job finished with an uncaught exception: `FAILURE`
Traceback (most recent call last):
  File "/usr/local/bin/mask_rcnn", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/entrypoint/mask_rcnn.py", line 12, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.
2021-06-15 11:30:38,341 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

It is NaN error. Let us focus on mismatching error firstly.
Please continue to test other tfrecords files.

I’ve tested all the tfrecords. The following tfrecords has mismatching issue at my side.
train-0013*-of-00256.tfrecord
train-0022*-of-00256.tfrecord