I always have this error when training reached to iteration 1885. How can I check which image is giving this issue?
(0) Invalid argument: Input to reshape is a tensor with 2879584 values, but the requested shape has 2691200
[[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
[[IteratorGetNext]]
(1) Invalid argument: Input to reshape is a tensor with 2879584 values, but the requested shape has 2691200
[[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
[[IteratorGetNext]]
[[IteratorGetNext/_3959]]
Now I have 23 classes using v2.0 dataset. The spec file is attached.
maskrcnn_train_resnet50.txt (2.2 KB)
More infos of errors is as followed.
[MaskRCNN] INFO : RPN total loss: 0.18269
DLL 2022-06-15 04:35:52.187695 - Iteration: 1885 RPN total loss : 0.18269
[MaskRCNN] INFO : Total loss: 2.25939
DLL 2022-06-15 04:35:52.188027 - Iteration: 1885 Total loss : 2.25939
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_15633}} Input to reshape is a tensor with 2879584 values, but the requested shape has 2691200
[[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
[[IteratorGetNext]]
(1) Invalid argument: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_15633}} Input to reshape is a tensor with 2879584 values, but the requested shape has 2691200
[[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
[[IteratorGetNext]]
[[IteratorGetNext/_3959]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in <module>
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 399, in train_and_eval
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Input to reshape is a tensor with 2879584 values, but the requested shape has 2691200
[[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
[[IteratorGetNext]]
(1) Invalid argument: Input to reshape is a tensor with 2879584 values, but the requested shape has 2691200
[[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
[[IteratorGetNext]]
[[IteratorGetNext/_3959]]
0 successful operations.
0 derived errors ignored.
[MaskRCNN] INFO : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
[MaskRCNN] INFO : Training Performance Summary
[MaskRCNN] INFO : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
DLL 2022-06-15 04:35:57.222443 - : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
DLL 2022-06-15 04:35:57.222565 - : Training Performance Summary
DLL 2022-06-15 04:35:57.222612 - : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
DLL 2022-06-15 04:35:57.222660 - Average_throughput : 7.1 samples/sec
DLL 2022-06-15 04:35:57.222697 - Total processed steps : 1890
DLL 2022-06-15 04:35:57.222745 - Total_processing_time : 0h 00m 00s
[MaskRCNN] INFO : Average throughput: 7.1 samples/sec
[MaskRCNN] INFO : Total processed steps: 1890
[MaskRCNN] INFO : Total processing time: 0h 00m 00s
DLL 2022-06-15 04:35:57.222996 - : ==================== Metrics ====================
[MaskRCNN] INFO : ==================== Metrics ====================
[MaskRCNN] INFO : FastRCNN box loss: 0.21846
DLL 2022-06-15 04:35:57.223386 - FastRCNN box loss : 0.21846
[MaskRCNN] INFO : FastRCNN class loss: 0.39607
DLL 2022-06-15 04:35:57.223522 - FastRCNN class loss : 0.39607
[MaskRCNN] INFO : FastRCNN total loss: 0.61454
DLL 2022-06-15 04:35:57.223641 - FastRCNN total loss : 0.61454
[MaskRCNN] INFO : L1 loss: 0.0000e+00
DLL 2022-06-15 04:35:57.223747 - L1 loss : 0.0000e+00
[MaskRCNN] INFO : L2 loss: 1.09823
DLL 2022-06-15 04:35:57.223883 - L2 loss : 1.09823
[MaskRCNN] INFO : Learning rate: 0.02
DLL 2022-06-15 04:35:57.224003 - Learning rate : 0.02
[MaskRCNN] INFO : Mask loss: 0.36394
DLL 2022-06-15 04:35:57.224126 - Mask loss : 0.36394
[MaskRCNN] INFO : RPN box loss: 0.06473
DLL 2022-06-15 04:35:57.224247 - RPN box loss : 0.06473
[MaskRCNN] INFO : RPN score loss: 0.11796
DLL 2022-06-15 04:35:57.224368 - RPN score loss : 0.11796
[MaskRCNN] INFO : RPN total loss: 0.18269
DLL 2022-06-15 04:35:57.224485 - RPN total loss : 0.18269
[MaskRCNN] INFO : Total loss: 2.25939
DLL 2022-06-15 04:35:57.224608 - Total loss : 2.25939
[MaskRCNN] ERROR : Job finished with an uncaught exception: `FAILURE`