ValueError: Total size of new array must be unchanged for box_head/class-predict/kernel lh_shape: [(1024, 1)], rh_shape: [(1024, 2)]

Hi,
I am getting the below error when I run “TLT MaskRCNN example usecase” on a custom-made dataset. My dataset includes images with size 545x800 and polygon annotations. Interestingly when I run tlt-train on the COCO dataset, there is no error and it runs smoothly with two gpus but pointing to the custom-made train.record data will create the issue.

For multi-GPU, change --gpus based on your machine.
2021-03-12 19:57:46.485641: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:57:46.502109: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
[MaskRCNN] INFO : Loading weights from /workspace/server/tlt-experiments/maskrcnn/experiment_dir_unpruned/model.step-0.tlt
[MaskRCNN] INFO : Loading weights from /workspace/server/tlt-experiments/maskrcnn/experiment_dir_unpruned/model.step-0.tlt
[MaskRCNN] INFO : Horovod successfully initialized …
[MaskRCNN] INFO : Create EncryptCheckpointSaverHook.

[MaskRCNN] INFO : =================================
[MaskRCNN] INFO : Start training cycle 01
[MaskRCNN] INFO : =================================

[MaskRCNN] INFO : Using Dataset Sharding with Horovod
2021-03-12 19:57:57.070815: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-03-12 19:57:57.129558: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0a:00.0
2021-03-12 19:57:57.129606: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:57:57.131500: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-03-12 19:57:57.132423: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-03-12 19:57:57.133049: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-03-12 19:57:57.135068: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-03-12 19:57:57.136282: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-03-12 19:57:57.143905: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-12 19:57:57.147315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-03-12 19:57:57.324039: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-03-12 19:57:57.354097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:41:00.0
2021-03-12 19:57:57.354165: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:57:57.355455: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-03-12 19:57:57.356925: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-03-12 19:57:57.357271: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-03-12 19:57:57.359074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-03-12 19:57:57.360853: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-03-12 19:57:57.365959: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-12 19:57:57.370115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_2/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_3/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_4/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_5/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_6/
2021-03-12 19:58:00.626653: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0a:00.0
2021-03-12 19:58:00.626735: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:58:00.626825: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-03-12 19:58:00.626867: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-03-12 19:58:00.626906: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-03-12 19:58:00.626943: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-03-12 19:58:00.626981: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-03-12 19:58:00.627018: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-12 19:58:00.630744: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-03-12 19:58:00.630799: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:58:01.020046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:41:00.0
2021-03-12 19:58:01.020159: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:58:01.020306: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-03-12 19:58:01.020345: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-03-12 19:58:01.020381: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-03-12 19:58:01.020414: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-03-12 19:58:01.020447: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-03-12 19:58:01.020480: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-12 19:58:01.024365: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-03-12 19:58:01.024439: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:58:01.063165: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-12 19:58:01.063231: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-03-12 19:58:01.063242: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-03-12 19:58:01.070051: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22514 MB memory) → physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:0a:00.0, compute capability: 7.5)
2021-03-12 19:58:01.425943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-12 19:58:01.426022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-03-12 19:58:01.426030: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-03-12 19:58:01.429791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21801 MB memory) → physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:41:00.0, compute capability: 7.5)
Parsing Inputs…
[MaskRCNN] INFO : [Training Compute Statistics] 372.7 GFLOPS/image
Using TensorFlow backend.
4 ops no flops stats due to incomplete shapes.

Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 58, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py”, line 187, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py”, line 90, in run_executer
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/distributed_executer.py”, line 393, in train_and_eval
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1195, in _train_model_default
saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1490, in _train_with_estimator_spec
log_step_count_steps=log_step_count_steps) as mon_sess:
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 584, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1014, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 713, in init
h.begin()
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/hooks/pretrained_restore_hook.py”, line 209, in begin
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/hooks/pretrained_restore_hook.py”, line 113, in assign_from_checkpoint
ValueError: Total size of new array must be unchanged for box_head/class-predict/kernel lh_shape: [(1024, 1)], rh_shape: [(1024, 2)]
[MaskRCNN] ERROR : Job finished with an uncaught exception: FAILURE
2021-03-12 19:58:08.250839: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:41:00.0
2021-03-12 19:58:08.250934: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:58:08.251019: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-03-12 19:58:08.251035: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-03-12 19:58:08.251049: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-03-12 19:58:08.251061: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-03-12 19:58:08.251072: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-03-12 19:58:08.251085: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-12 19:58:08.252041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-03-12 19:58:08.252090: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-12 19:58:08.252097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-03-12 19:58:08.252102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-03-12 19:58:08.253112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device

(/job:localhost/replica:0/task:0/device:GPU:0 with 21801 MB memory) → physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:41:00.0, compute capability: 7.5)

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[3548,1],0]
Exit code: 1

Can you share more info about your own dataset?
How did you generate the train.record, etc?

I converted the mask annotations into coco json format using pycococreator and then using this link to convert into tfrecords. I couldn’t use ‘create_coco_tf_record.py’ from “TLT MaskRCNN example usecase”, because it asks for caption_annotation_file to convert coco json to TFRecords. The annotations are all masks (polygons) and there is no caption.

Can you try to follow COCO format to generate your own dataset?
TLT MaskRCNN only accepts COCO detection label format.

#### MaskRCNN

  • Input size : C * W * H (where C = 3, W > =128, H >=128 and W, H are multiples of 32)
  • Image format : JPG
  • Label format : COCO detection

I have solved the problem and the model is training without any issue.
for people who get the same error, you need to update the number of classes in the config file equal to the number of classes in your dataset without considering background as another class. I was confused because in the Mask RCNN repository if we have for example two classes, we need to add num_classes = 3 (number of classes + background) but apparently not in tlt! I hope documentation for config files will be updated.

1 Like

Thanks a lot! I will sync with internal team to describe more details about num_classes in MaskRCNN — Transfer Learning Toolkit 3.0 documentation