ValueError: Total size of new array must be unchanged for box_head/class-predict/kernel lh_shape: [(1024, 1)], rh_shape: [(1024, 2)]

mohsen.zardadi · March 12, 2021, 11:30pm

Hi,
I am getting the below error when I run “TLT MaskRCNN example usecase” on a custom-made dataset. My dataset includes images with size 545x800 and polygon annotations. Interestingly when I run tlt-train on the COCO dataset, there is no error and it runs smoothly with two gpus but pointing to the custom-made train.record data will create the issue.

For multi-GPU, change --gpus based on your machine.
2021-03-12 19:57:46.485641: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:57:46.502109: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
[MaskRCNN] INFO : Loading weights from /workspace/server/tlt-experiments/maskrcnn/experiment_dir_unpruned/model.step-0.tlt
[MaskRCNN] INFO : Loading weights from /workspace/server/tlt-experiments/maskrcnn/experiment_dir_unpruned/model.step-0.tlt
[MaskRCNN] INFO : Horovod successfully initialized …
[MaskRCNN] INFO : Create EncryptCheckpointSaverHook.

[MaskRCNN] INFO : =================================
[MaskRCNN] INFO : Start training cycle 01
[MaskRCNN] INFO : =================================

[MaskRCNN] INFO : Using Dataset Sharding with Horovod
2021-03-12 19:57:57.070815: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-03-12 19:57:57.129558: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0a:00.0
2021-03-12 19:57:57.129606: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:57:57.131500: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-03-12 19:57:57.132423: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-03-12 19:57:57.133049: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-03-12 19:57:57.135068: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-03-12 19:57:57.136282: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-03-12 19:57:57.143905: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-12 19:57:57.147315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-03-12 19:57:57.324039: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-03-12 19:57:57.354097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:41:00.0
2021-03-12 19:57:57.354165: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:57:57.355455: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-03-12 19:57:57.356925: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-03-12 19:57:57.357271: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-03-12 19:57:57.359074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-03-12 19:57:57.360853: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-03-12 19:57:57.365959: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-12 19:57:57.370115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_2/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_3/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_4/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_5/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_6/
2021-03-12 19:58:00.626653: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0a:00.0
2021-03-12 19:58:00.626735: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:58:00.626825: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-03-12 19:58:00.626867: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-03-12 19:58:00.626906: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-03-12 19:58:00.626943: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-03-12 19:58:00.626981: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-03-12 19:58:00.627018: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-12 19:58:00.630744: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-03-12 19:58:00.630799: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:58:01.020046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:41:00.0
2021-03-12 19:58:01.020159: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:58:01.020306: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-03-12 19:58:01.020345: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-03-12 19:58:01.020381: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-03-12 19:58:01.020414: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-03-12 19:58:01.020447: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-03-12 19:58:01.020480: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-12 19:58:01.024365: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-03-12 19:58:01.024439: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:58:01.063165: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-12 19:58:01.063231: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-03-12 19:58:01.063242: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-03-12 19:58:01.070051: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22514 MB memory) → physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:0a:00.0, compute capability: 7.5)
2021-03-12 19:58:01.425943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-12 19:58:01.426022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-03-12 19:58:01.426030: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-03-12 19:58:01.429791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21801 MB memory) → physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:41:00.0, compute capability: 7.5)
Parsing Inputs…
[MaskRCNN] INFO : [Training Compute Statistics] 372.7 GFLOPS/image
Using TensorFlow backend.
4 ops no flops stats due to incomplete shapes.

Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 58, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py”, line 187, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py”, line 90, in run_executer
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/distributed_executer.py”, line 393, in train_and_eval
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1195, in _train_model_default
saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1490, in _train_with_estimator_spec
log_step_count_steps=log_step_count_steps) as mon_sess:
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 584, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1014, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 713, in init
h.begin()
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/hooks/pretrained_restore_hook.py”, line 209, in begin
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/hooks/pretrained_restore_hook.py”, line 113, in assign_from_checkpoint
ValueError: Total size of new array must be unchanged for box_head/class-predict/kernel lh_shape: [(1024, 1)], rh_shape: [(1024, 2)]
[MaskRCNN] ERROR : Job finished with an uncaught exception: FAILURE
2021-03-12 19:58:08.250839: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:41:00.0
2021-03-12 19:58:08.250934: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:58:08.251019: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-03-12 19:58:08.251035: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-03-12 19:58:08.251049: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-03-12 19:58:08.251061: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-03-12 19:58:08.251072: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-03-12 19:58:08.251085: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-12 19:58:08.252041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-03-12 19:58:08.252090: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-12 19:58:08.252097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-03-12 19:58:08.252102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-03-12 19:58:08.253112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device

(/job:localhost/replica:0/task:0/device:GPU:0 with 21801 MB memory) → physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:41:00.0, compute capability: 7.5)

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[3548,1],0]
Exit code: 1

Morganh · March 13, 2021, 5:29am

Can you share more info about your own dataset?
How did you generate the train.record, etc?

mohsen.zardadi · March 13, 2021, 7:40am

I converted the mask annotations into coco json format using pycococreator and then using this link to convert into tfrecords. I couldn’t use ‘create_coco_tf_record.py’ from “TLT MaskRCNN example usecase”, because it asks for caption_annotation_file to convert coco json to TFRecords. The annotations are all masks (polygons) and there is no caption.

Morganh · March 15, 2021, 6:38am

Can you try to follow COCO format to generate your own dataset?
TLT MaskRCNN only accepts COCO detection label format.

#### MaskRCNN

Input size : C * W * H (where C = 3, W > =128, H >=128 and W, H are multiples of 32)

Image format : JPG

Label format : COCO detection

mohsen.zardadi · April 2, 2021, 5:03am

I have solved the problem and the model is training without any issue.
for people who get the same error, you need to update the number of classes in the config file equal to the number of classes in your dataset without considering background as another class. I was confused because in the Mask RCNN repository if we have for example two classes, we need to add num_classes = 3 (number of classes + background) but apparently not in tlt! I hope documentation for config files will be updated.

Morganh · April 2, 2021, 3:41pm

Thanks a lot! I will sync with internal team to describe more details about num_classes in MaskRCNN — Transfer Learning Toolkit 3.0 documentation

Morganh · April 23, 2021, 7:53am

Please note that for the number of classes, if there are N categories in the annotation, num_classes should be N+1 (background class).

Topic		Replies	Views
Requested more than 0 entries, but params is empty. Params shape: [0,1920,1080] TAO Toolkit	23	1839	March 1, 2022
Permission denied: 'mrcnn_log.json' while converting data into tfrecords TAO Toolkit	9	902	August 16, 2022
Mask-RCCNN fails on custom dataset TAO Toolkit tensorflow	4	1096	March 4, 2023
TLT train maskrcnn model with Mapillary Vistas Dataset failed on CUDA_ERROR_OUT_OF_MEMORY: out of memory TAO Toolkit cuda	86	3162	September 21, 2021
Mask R-CNN hangs during training using custom made tfrecords TensorRT	3	454	April 13, 2021
Mask R-CNN hangs during training using custom made tfrecords TAO Toolkit	31	3485	October 12, 2021
Maskrcnn pruning not working TAO Toolkit	5	73	July 30, 2024
Train mask-rcnn failure TAO Toolkit tao	16	1182	November 25, 2021
Train with my own tlt model TAO Toolkit	14	713	December 13, 2021
Mask-RCNN int8 Version Results in Poor Performance TAO Toolkit	37	1008	July 6, 2022

ValueError: Total size of new array must be unchanged for box_head/class-predict/kernel lh_shape: [(1024, 1)], rh_shape: [(1024, 2)]

(/job:localhost/replica:0/task:0/device:GPU:0 with 21801 MB memory) → physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:41:00.0, compute capability: 7.5)

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Related topics

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.