It is not necessary. I just use a dummy caption file.
Yes, you can commented the code related with captions in create_coco_tf_record.py.
One more question. When I succeeded in training with your tfrecords and json. I set my spec like:
data_config{
…
training_file_pattern: “/workspace/tlt-experiments/mapillary/result0525_images_500_resize/train*.tfrecord”
validation_file_pattern: “/workspace/tlt-experiments/mapillary/tf_resized/val*.tfrecord”
val_json_file: “/workspace/tlt-experiments/mapillary/annotations/instances_random500_shape_validation2020_resize.json”
…
}
I thought the instances_random500_shape_validation2020_resize.json was the validation annotations file as the val_json_file should set to be. But from your tfrecords generating command, the instances_random500_shape_validation2020_resize.json is the train annotations file?
I just generate tfrecords and their corresponding json file to check OOM issue. No matter it is used as training file or validation file.
As you may know, I’ve picked 1000 random images from Vistas for training and 500 images for validation. I followed the same steps as you did to get the 1/8 resized tfrecords and json. If I trained with the 1000 images, I got the error messages as below. Not sure if it’s still related with memory. But if I trained with that 500 images for validation, everything was OK. So is it related with the data amount?
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_15647}} Input to reshape is a tensor with 3525472 values, but the requested shape has 2691200
[[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
[[IteratorGetNext]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 196, in <module>
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 192, in main
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 91, in run_executer
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 394, in train_and_eval
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 3525472 values, but the requested shape has 2691200
[[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
[[IteratorGetNext]]
[MaskRCNN] ERROR : Job finished with an uncaught exception: `FAILURE`
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
[[{{node DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_178_0}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 196, in <module>
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 192, in main
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 91, in run_executer
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 394, in train_and_eval
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
[[node DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_178_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Original stack trace for 'DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_178_0':
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 196, in <module>
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 192, in main
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 91, in run_executer
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 394, in train_and_eval
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default
features, labels, ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 548, in mask_rcnn_model_fn
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 493, in _model_fn
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 256, in compute_gradients
avg_grads = self._allreduce_grads(grads)
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 210, in allreduce_grads
for grad in grads]
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 210, in <listcomp>
for grad in grads]
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 80, in allreduce
summed_tensor_compressed = _allreduce(tensor_compressed)
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 86, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name)
File "<string>", line 80, in horovod_allreduce
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
[[{{node DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_178_0}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 196, in <module>
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 192, in main
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 91, in run_executer
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 394, in train_and_eval
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
[[node DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_178_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Original stack trace for 'DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_AddN_178_0':
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 196, in <module>
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 192, in main
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 91, in run_executer
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 394, in train_and_eval
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default
features, labels, ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 548, in mask_rcnn_model_fn
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/models/mask_rcnn_model.py", line 493, in _model_fn
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 256, in compute_gradients
avg_grads = self._allreduce_grads(grads)
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 210, in allreduce_grads
for grad in grads]
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 210, in <listcomp>
for grad in grads]
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 80, in allreduce
summed_tensor_compressed = _allreduce(tensor_compressed)
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 86, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name)
File "<string>", line 80, in horovod_allreduce
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[47062,1],0]
Exit code: 1
--------------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/bin/mask_rcnn", line 8, in <module>
sys.exit(main())
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/entrypoint/mask_rcnn.py", line 12, in main
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.
2021-06-02 11:21:09,128 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
Can you double check your tfrecords files for 1000 images? Do they mix up the tfrecords files of 500 images?
BTW, can you share the 1000 images along with their json file?
I think I’ve figured out what’s the problem is. It’s about the size of single tfrecord file. Your random 500 images’ tfrecords were generated by the default num_shards in create_coco_tf_record.py, which is 256. I also generated my 1000 images by the default num_shards which made the tfrecord file was about double size.
I increased the num_shards in create_coco_tf_record.py to 512 and re-generated the tfrecords. There was no error messages any more.
My question is that the single tfrecord file size about ~300k could make the training go ahead now. But the single tfrecord file in COCO dataset is ~80M. Why does that happen?
Can yo set to default num_shards 256, and generate tfrecords for the 1000 images , then retry again?
Sorry, my mistake. The num_shards didn’t help me with my 1000 images training. It was the training_file_pattern in spec I set to val*.tfrecord (which generated from 500 val images with num_shards 256) made the training succeeded.
Here’s the 1/8 resized 1000 images:
https://drive.google.com/file/d/1ymqOKKFN3u8qmHTYlqyBumIAVqXOeMck/view?usp=sharing
Json:
https://drive.google.com/file/d/16WpE_Pi0M_dnPtp_UmDvMfR4-l3fNtZH/view?usp=sharing
Generated tfrecords:
https://drive.google.com/file/d/1ocz7NADPwkXQPaAqECCcirv8OLFvQ8x2/view?usp=sharing
Hi,
It seems the problem was from the shape mismatch. May I know which paramater could I set to increase the requested shape?
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 3525472 values, but the requested shape has 2691200
[[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
[[IteratorGetNext]]
What is your training spec? The mismatching may result from dataset. We need to inspect.
My spec:
seed: 123
use_amp: False
warmup_steps: 1000
checkpoint: "/workspace/tlt-experiments/maskrcnn/pretrained_resnet50/tlt_instance_segmentation_vresnet50/resnet50.hdf5"
learning_rate_steps: "[10000, 15000, 20000]"
learning_rate_decay_levels: "[0.1, 0.02, 0.01]"
total_steps: 25000
train_batch_size: 1
eval_batch_size: 1
num_steps_per_eval: 5000
momentum: 0.9
l2_weight_decay: 0.0001
warmup_learning_rate: 0.0001
init_learning_rate: 0.01
data_config{
image_size: "(128, 128)"#"(832, 1344)"
augment_input_data: True
eval_samples: 500
training_file_pattern: "/workspace/tlt-experiments/mapillary/tf_resized/train*.tfrecord"
validation_file_pattern: "/workspace/tlt-experiments/mapillary/tf_resized/val*.tfrecord"
val_json_file: "/workspace/tlt-experiments/mapillary/annotations/instances_random500_shape_validation2020.json"
# dataset specific parameters
num_classes: 124
skip_crowd_during_training: True
}
maskrcnn_config {
nlayers: 50
arch: "resnet"
freeze_bn: True
freeze_blocks: "[0,1]"
gt_mask_size: 112
# Region Proposal Network
rpn_positive_overlap: 0.7
rpn_negative_overlap: 0.3
rpn_batch_size_per_im: 256
rpn_fg_fraction: 0.5
rpn_min_size: 0.
# Proposal layer.
batch_size_per_im: 512
fg_fraction: 0.25
fg_thresh: 0.5
bg_thresh_hi: 0.5
bg_thresh_lo: 0.
# Faster-RCNN heads.
fast_rcnn_mlp_head_dim: 1024
bbox_reg_weights: "(10., 10., 5., 5.)"
# Mask-RCNN heads.
include_mask: True
mrcnn_resolution: 28
# training
train_rpn_pre_nms_topn: 2000
train_rpn_post_nms_topn: 1000
train_rpn_nms_threshold: 0.7
# evaluation
test_detections_per_image: 100
test_nms: 0.5
test_rpn_pre_nms_topn: 1000
test_rpn_post_nms_topn: 1000
test_rpn_nms_thresh: 0.7
# model architecture
min_level: 2
max_level: 6
num_scales: 1
aspect_ratios: "[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]"
anchor_scale: 8
# localization loss
rpn_box_loss_weight: 1.0
fast_rcnn_box_loss_weight: 1.0
mrcnn_weight_loss_mask: 1.0
}
Could you also try to verify below too?
500 images: no issue
750 images: check if there is mismatching issue? If yes, how about 600 images, 550 images, …?
I run experiments on your tfrecords.
No issue when I tried
“train-0000*-of-00256.tfrecord” or
“train-0001*-of-00256.tfrecord” or
“train-0002*-of-00256.tfrecord” or
“train-0003*-of-00256.tfrecord”.
But, when I tried “train-0004*-of-00256.tfrecord” (totally 10 tfrecords files), the mismatching issue happened.
training_file_pattern: “/workspace/demo_3.0/maskrcnn_cvat/tfrecords_OOM/random_1000/tfrecords/train-0004*-of-00256.tfrecord”
So, please use above way to narrow down the issue. There should be something wrong in some tfrecords files.
I’ve tried to pick up randomly 750 images from my 1000 images, then 500 images from that 750 images, and even 200 images from 500 images, all of them had mismatching issue…
I also tried to pick up another 2 sets of 1000 images randomly from Mapillary Vistas’s 18000 trainig images, also failed in mismatching issue…
Please see above experiments from me, the /demo_3.0/maskrcnn_cvat/tfrecords_OOM/random_1000/tfrecords/train-0004*-of-00256.tfrecord” has mismatching issue.
No issue when I tried
“train-0000*-of-00256.tfrecord” or
“train-0001*-of-00256.tfrecord” or
“train-0002*-of-00256.tfrecord” or
“train-0003*-of-00256.tfrecord”.
Is there something wrong with the images in train-0004*-of-00256.tfrecord? But all the tfrecord are transformed in the same way. And the no issue 500 val images also followed this way to get the tfrecord.
Firstly, please check if you can get the same result as mine.
I tried you experimets as above(with --gpus 1 in command line),
training_file_pattern: “/workspace/tlt-experiments/mapillary/tf_resized/train-0001*-of-00256.tfrecord”
or
training_file_pattern: “/workspace/tlt-experiments/mapillary/tf_resized/train-0004*-of-00256.tfrecord”
Instead of mismatching issue, I got the following error:
[MaskRCNN] INFO : # ============================================= #
[MaskRCNN] INFO : Start Training
[MaskRCNN] INFO : # %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% #
[GPU 00] Restoring pretrained weights (307 Tensors) from: /tmp/tmpamkb67yb/model.ckpt-5000
[MaskRCNN] INFO : Pretrained weights loaded with success...
[MaskRCNN] INFO : Saving checkpoints for 5000 into /workspace/tlt-experiments/maskrcnn/experiment_dir_unpruned/model.step-5000.tlt.
[MaskRCNN] INFO : timestamp: 1623727825.9933314
[MaskRCNN] INFO : iteration: 5005
DLL 2021-06-15 03:30:25.994966 - iteration : 5005
[MaskRCNN] INFO : throughput: 1.5 samples/sec
DLL 2021-06-15 03:30:25.995717 - Iteration: 5005 throughput : 1.534986104559237
[MaskRCNN] INFO : ==================== Metrics =====================
[MaskRCNN] INFO : FastRCNN box loss: 0.25521
DLL 2021-06-15 03:30:25.997679 - Iteration: 5005 FastRCNN box loss : 0.25521
[MaskRCNN] INFO : FastRCNN class loss: 1.43679
DLL 2021-06-15 03:30:25.998058 - Iteration: 5005 FastRCNN class loss : 1.43679
[MaskRCNN] INFO : FastRCNN total loss: 1.692
DLL 2021-06-15 03:30:25.998390 - Iteration: 5005 FastRCNN total loss : 1.692
[MaskRCNN] INFO : L2 loss: 2.05043
DLL 2021-06-15 03:30:25.998713 - Iteration: 5005 L2 loss : 2.05043
[MaskRCNN] INFO : Learning rate: 0.01
DLL 2021-06-15 03:30:25.999047 - Iteration: 5005 Learning rate : 0.01
[MaskRCNN] INFO : Mask loss: 1.20826
DLL 2021-06-15 03:30:25.999371 - Iteration: 5005 Mask loss : 1.20826
[MaskRCNN] INFO : RPN box loss: 0.13479
DLL 2021-06-15 03:30:25.999682 - Iteration: 5005 RPN box loss : 0.13479
[MaskRCNN] INFO : RPN score loss: 1.17163
DLL 2021-06-15 03:30:26 - Iteration: 5005 RPN score loss : 1.17163
[MaskRCNN] INFO : RPN total loss: 1.30641
DLL 2021-06-15 03:30:26.000312 - Iteration: 5005 RPN total loss : 1.30641
[MaskRCNN] INFO : Total loss: 6.25711
DLL 2021-06-15 03:30:26.000659 - Iteration: 5005 Total loss : 6.25711
[MaskRCNN] INFO : timestamp: 1623727828.1025558
[MaskRCNN] INFO : iteration: 5010
DLL 2021-06-15 03:30:28.103584 - iteration : 5010
[MaskRCNN] INFO : throughput: 2.0 samples/sec
DLL 2021-06-15 03:30:28.104061 - Iteration: 5010 throughput : 1.9517826742772044
[MaskRCNN] INFO : ==================== Metrics =====================
[MaskRCNN] INFO : FastRCNN box loss: 53.14843
DLL 2021-06-15 03:30:28.105638 - Iteration: 5010 FastRCNN box loss : 53.14843
[MaskRCNN] INFO : FastRCNN class loss: 1403.56396
DLL 2021-06-15 03:30:28.105985 - Iteration: 5010 FastRCNN class loss : 1403.56396
[MaskRCNN] INFO : FastRCNN total loss: 1456.7124
DLL 2021-06-15 03:30:28.106302 - Iteration: 5010 FastRCNN total loss : 1456.7124
[MaskRCNN] INFO : L2 loss: 2.05362
DLL 2021-06-15 03:30:28.106613 - Iteration: 5010 L2 loss : 2.05362
[MaskRCNN] INFO : Learning rate: 0.01
DLL 2021-06-15 03:30:28.106922 - Iteration: 5010 Learning rate : 0.01
[MaskRCNN] INFO : Mask loss: 3.75593
DLL 2021-06-15 03:30:28.107217 - Iteration: 5010 Mask loss : 3.75593
[MaskRCNN] INFO : RPN box loss: 1.09253
DLL 2021-06-15 03:30:28.107506 - Iteration: 5010 RPN box loss : 1.09253
[MaskRCNN] INFO : RPN score loss: 79.7173
DLL 2021-06-15 03:30:28.107788 - Iteration: 5010 RPN score loss : 79.7173
[MaskRCNN] INFO : RPN total loss: 80.80984
DLL 2021-06-15 03:30:28.108069 - Iteration: 5010 RPN total loss : 80.80984
[MaskRCNN] INFO : Total loss: 1543.33179
DLL 2021-06-15 03:30:28.108349 - Iteration: 5010 Total loss : 1543.33179
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 196, in <module>
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 192, in main
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 91, in run_executer
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 394, in train_and_eval
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1426, in run
run_metadata=run_metadata))
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/basic_session_run_hooks.py", line 761, in after_run
raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.
[MaskRCNN] INFO : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
[MaskRCNN] INFO : Training Performance Summary
[MaskRCNN] INFO : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
DLL 2021-06-15 03:30:33.191428 - : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
DLL 2021-06-15 03:30:33.191556 - : Training Performance Summary
DLL 2021-06-15 03:30:33.191611 - : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
DLL 2021-06-15 03:30:33.191688 - Average_throughput : 2.1 samples/sec
DLL 2021-06-15 03:30:33.191740 - Total processed steps : 5012
DLL 2021-06-15 03:30:33.191797 - Total_processing_time : 0h 00m 00s
[MaskRCNN] INFO : Average throughput: 2.1 samples/sec
[MaskRCNN] INFO : Total processed steps: 5012
[MaskRCNN] INFO : Total processing time: 0h 00m 00s
DLL 2021-06-15 03:30:33.192110 - : ==================== Metrics ====================
[MaskRCNN] INFO : ==================== Metrics ====================
[MaskRCNN] INFO : FastRCNN box loss: 53.14843
DLL 2021-06-15 03:30:33.192554 - FastRCNN box loss : 53.14843
[MaskRCNN] INFO : FastRCNN class loss: 1403.56396
DLL 2021-06-15 03:30:33.192733 - FastRCNN class loss : 1403.56396
[MaskRCNN] INFO : FastRCNN total loss: 1456.7124
DLL 2021-06-15 03:30:33.192922 - FastRCNN total loss : 1456.7124
[MaskRCNN] INFO : L2 loss: 2.05362
DLL 2021-06-15 03:30:33.193088 - L2 loss : 2.05362
[MaskRCNN] INFO : Learning rate: 0.01
DLL 2021-06-15 03:30:33.193252 - Learning rate : 0.01
[MaskRCNN] INFO : Mask loss: 3.75593
DLL 2021-06-15 03:30:33.193422 - Mask loss : 3.75593
[MaskRCNN] INFO : RPN box loss: 1.09253
DLL 2021-06-15 03:30:33.193582 - RPN box loss : 1.09253
[MaskRCNN] INFO : RPN score loss: 79.7173
DLL 2021-06-15 03:30:33.193756 - RPN score loss : 79.7173
[MaskRCNN] INFO : RPN total loss: 80.80984
DLL 2021-06-15 03:30:33.193930 - RPN total loss : 80.80984
[MaskRCNN] INFO : Total loss: 1543.33179
DLL 2021-06-15 03:30:33.194088 - Total loss : 1543.33179
[MaskRCNN] ERROR : Job finished with an uncaught exception: `FAILURE`
Traceback (most recent call last):
File "/usr/local/bin/mask_rcnn", line 8, in <module>
sys.exit(main())
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/entrypoint/mask_rcnn.py", line 12, in main
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.
2021-06-15 11:30:38,341 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
It is NaN error. Let us focus on mismatching error firstly.
Please continue to test other tfrecords files.
I’ve tested all the tfrecords. The following tfrecords has mismatching issue at my side.
train-0013*-of-00256.tfrecord
train-0022*-of-00256.tfrecord