MaskRCNN Input to reshape is a tensor with 3135248 values, but the requested shape has 2691200

Hello,

Upon running the Tao training, by modifying the Sample maskrcnn Jupyter notebook, I get the following error:
Input to reshape is a tensor with 3135248 values, but the requested shape has 2691200

My dataset has images of different sizes, could that be the issue?
Most images are: 1280x720
Some are: 960x720

Thank you.

• Hardware (T4/V100/Xavier/Nano/etc) AWS g4 Large
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) MaskRCNN Resnet50
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
maskrcnn_train_resnet50_lcv.txt (2.0 KB)

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Please set a higher max_num_instances in the spec file.
Refer to MaskRCNN - NVIDIA Docs and Training doesn't converge for Mapillary Vistas Dataset training with MaskRCNN - #46 by edit_or

Thank you for your response.

I added max_num_instances to the config and set it to 300.
The training ran 1 epoch and then gave the same error as above but with increased values
Input to reshape is a tensor with 45xxxxx values, but the requested shape has 40xxxxx

Thereafter I updated the max_num_instances to 600. Now it is simply giving the following error:

[GPU 00] Restoring pretrained weights (265 Tensors)
[MaskRCNN] INFO : Pretrained weights loaded with success…

[MaskRCNN] INFO : Saving checkpoints for 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt.
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-04-13 09:45:22,748 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

It is running successfully.
You can ignore

Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

Hi Morgan,

But the process is exiting after the above line.
The training is not running at all.

I just deleted the experiment_dir_unpruned and re ran the training.
I still got the same error. The training is quitting instantly.

Here is the error:

INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[GPU 00] Restoring pretrained weights (265 Tensors)
[MaskRCNN] INFO : Pretrained weights loaded with success…

[MaskRCNN] INFO : Saving checkpoints for 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt.
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-04-13 12:56:53,268 [INFO] tlt.components.docker_handler.docker_handler: Stopping container. `

To narrow down, please open a terminal to login the docker and run it again.
Step:
$ tao mask_rcnn run /bin/bash

Then inside the docker, run command without tao.
mask_rcnn train xxx

Hi, same output and error inside the docker using the above commands.

Please let me know if you want me to upload any other files.

[MaskRCNN] WARNING : Checkpoint is missing variable [conv5-mask/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_fcn_logits/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_fcn_logits/bias]
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[GPU 00] Restoring pretrained weights (265 Tensors)
[MaskRCNN] INFO : Pretrained weights loaded with success…

[MaskRCNN] INFO : Saving checkpoints for 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt.
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

I re ran the training, without clearing the unpruned folder, and output it all to the attached logged file. log.txt (17.1 KB)

I hope this helps!

Btw,
I was able to run mask_rcnn on the same VM 3 weeks ago, using a smaller dataset (about 20 images)

I am afraid it is due to out of memory(OOM).
Can you use part of “/workspace/tao-experiments/data/train*.tfrecord” and retry.
BTW, please use a new result folder as well.

I am cleaning out the result folder each time, so its a fresh folder of the same name.

I currently had 256 train*.tfrecords
I moved 236 out of the folder leaving only 20 train*.tfrecords.

The process seems to have run for a few epochs and then quit again. Please see the last few lines of the output below. I feel your solution has partially helped. However,
a) The process still failed…
b) I do need to run the training on thousands of images eventually.
c) This is running on a g4dn.xlarge instance on AWS. I am assuming that the specs of the machine are capable of running such trainings ? Or do I need to do something there?

Thank you again for taking the time to help! Looking forward to your continued help on this :)

[INFO] Epoch 21/8334: loss: 3.79883 learning rate: 0.00071 Time taken: 0:00:03.604673 ETA: 8:19:25.643909

[INFO] None [INFO] Epoch 22/8334: loss: 3.85096 learning rate: 0.00074 Time taken: 0:00:03.844755 ETA: 8:52:37.604996 [INFO] None [INFO] Epoch 23/8334: loss: 3.89809 learning rate: 0.00077 Time taken: 0:00:03.740002 ETA: 8:38:03.155931 [INFO] Global step 70 (epoch 24/8334): total loss: 3.73200 (rpn score loss: 0.08423 rpn box loss: 0.06804 fast_rcnn class loss: 0.38038 fast_rcnn box loss: 0.53219) learning rate: 0.00078 [INFO] None [INFO] Epoch 24/8334: loss: 3.91095 learning rate: 0.00080 Time taken: 0:00:03.629286 ETA: 8:22:39.365101 [INFO] None [INFO] Epoch 25/8334: loss: 3.81268 learning rate: 0.00083 Time taken: 0:00:03.684081 ETA: 8:30:11.031655 Telemetry data couldn’t be sent, but the command ran successfully. [WARNING]: <urlopen error [Errno -2] Name or service not known> Execution status: FAIL 2023-04-17 11:42:33,561 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Please set lower bs and use below resolution.
train_batch_size: 1
eval_batch_size: 1
image_size: “(704, 1280)”

Hi,

I just ran it with these changes. It ran better, but got another error:

[INFO] Epoch 74/4167: loss: 4.12201 learning rate: 0.00449 Time taken: 0:00:02.699245 ETA: 3:04:08.008711
[INFO] Global step 450 (epoch 76/4167): total loss: 8.75753 (rpn score loss: 0.13929 rpn box loss: 0.60771 fast_rcnn class loss: 3.62965 fast_rcnn box loss: 1.00259) learning rate: 0.00455
[INFO] None
[INFO] Epoch 75/4167: loss: 8.75753 learning rate: 0.00455 Time taken: 0:00:02.731563 ETA: 3:06:17.556170
[INFO] None
[INFO] Epoch 76/4167: loss: 6.59582 learning rate: 0.00460 Time taken: 0:00:02.937815 ETA: 3:20:18.599989
[INFO] Global step 460 (epoch 77/4167): total loss: 32.50277 (rpn score loss: 0.30818 rpn box loss: 0.38129 fast_rcnn class loss: 9.58305 fast_rcnn box loss: 19.29635) learning rate: 0.00464
[INFO] None
[INFO] Epoch 77/4167: loss: 81199.67188 learning rate: 0.00466 Time taken: 0:00:02.637393 ETA: 2:59:46.939311
ERROR:tensorflow:Model diverged with loss = NaN.
[INFO] NaN loss during training.

I guess this is another hyperparamter issue. I’ve added the spec file again, if you could guide me on which ones to tweak. Also, I am still running this on only 20 training images. I would like to run it on all 256 images. What do you suggest?
maskrcnn_train_resnet50_lcv.txt (2.0 KB)

I think you are training 20 tfreocrds instead of 20 images, right? Normally one tfrecord file will contain multiple images. Please check further. You can simply check the quantity of totally training images and then divide 256 to get how many images in each tfrecord file.

For nan loss, suggest to set a lower learning rate. For example,
init_learning_rate: 0.008

More, for OOM, refer to MaskRCNN - NVIDIA Docs,
you can try reducing the n_workers , shuffle_buffer_size.

Yes. I meant 20 tfrecords! My mistake.
My dataset has only 113 images…

I dropped the init_learning_rate till 0.006.
This helped a bit. But I still got the NaN error after 130 epochs.

Thereafter I tried reducing the n_workers : 4 and added back all the tf records to see if that helps. However, the training quit without any reason like it had before.

I then populated all the .tfrecord file sizes.
After train 80/256 the files were all empty.
Also, randomly 5 files were significantly larger than the rest of the files.

Hence I did the following:
removed empty tfrecords (which should not matter)
Removed the 5 tfrecords that were significantly larger
set init_learning_rate: 0.006.
set n_workers : 4

the training now ran successfully uptill 1300 epochs. and then again failed. with the error:

NaN loss during training

There seems to be 2 issues at play here:

  • Getting all the .tfrecords/ images to load and work correctly.
  • Getting the training to finish the planned 4,167 epochs without crashing with the NaN error.

Since there are only 113 images, I suggest you re-generate tfrecords files again with 18 parts instead of 256 parts. Then try again.

Will try that and will maintain the low learning rate and n_workers

Thank you

Working with 18 shards caused it to quit very quickly.
init_learning_rate: 0.006.
n_workers : 4
learning_rate_steps: “[10000, 15000, 20000]”
learning_rate_decay_levels: “[0.1, 0.02, 0.01]”
total_steps: 25000
train_batch_size: 1
eval_batch_size: 1

[INFO] Epoch 10/4167: loss: 4.15287 learning rate: 0.00045 Time taken: 0:00:02.600452 ETA: 3:00:10.077749
[INFO] None
[INFO] Epoch 11/4167: loss: 3.71938 learning rate: 0.00048 Time taken: 0:00:02.719708 ETA: 3:08:23.106306
[INFO] Global step 70 (epoch 12/4167): total loss: 4.26452 (rpn score loss: 0.12431 rpn box loss: 0.03694 fast_rcnn class loss: 0.46742 fast_rcnn box loss: 0.58960) learning rate: 0.00051
[INFO] None
[INFO] Epoch 12/4167: loss: 3.92960 learning rate: 0.00052 Time taken: 0:00:02.913011 ETA: 3:21:43.562003
[INFO] None
[INFO] Epoch 13/4167: loss: 3.86232 learning rate: 0.00055 Time taken: 0:00:02.749284 ETA: 3:10:20.523876
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-04-18 08:17:47,274 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Can you change
num_examples_per_epoch: 6

to
num_examples_per_epoch: 113

You can ignore above warning info. As mentioned above, please login docker and run.

I’ve logged into the Docker & changed the num_examples_per_epoch

[MaskRCNN] WARNING : Checkpoint is missing variable [mask_fcn_logits/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_fcn_logits/bias]
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[GPU 00] Restoring pretrained weights (265 Tensors)
[MaskRCNN] INFO : Pretrained weights loaded with success…

[MaskRCNN] INFO : Saving checkpoints for 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_18shards/model.step-0.tlt.
[INFO] Global step 10 (epoch 1/298): total loss: 3.99603 (rpn score loss: 0.66408 rpn box loss: 0.05340 fast_rcnn class loss: 0.08196 fast_rcnn box loss: 0.00064) learning rate: 0.00015
[INFO] Global step 20 (epoch 1/298): total loss: 4.05946 (rpn score loss: 0.45428 rpn box loss: 0.04296 fast_rcnn class loss: 0.24224 fast_rcnn box loss: 0.35610) learning rate: 0.00021
[INFO] Global step 30 (epoch 1/298): total loss: 4.11123 (rpn score loss: 0.31507 rpn box loss: 0.05506 fast_rcnn class loss: 0.36135 fast_rcnn box loss: 0.51493) learning rate: 0.00027
[INFO] Global step 40 (epoch 1/298): total loss: 3.69822 (rpn score loss: 0.19592 rpn box loss: 0.04149 fast_rcnn class loss: 0.23611 fast_rcnn box loss: 0.39352) learning rate: 0.00033
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
I have no name!@52d0a71ce836:/workspace$