Upon running the Tao training, by modifying the Sample maskrcnn Jupyter notebook, I get the following error:
Input to reshape is a tensor with 3135248 values, but the requested shape has 2691200
My dataset has images of different sizes, could that be the issue?
Most images are: 1280x720
Some are: 960x720
Thank you.
• Hardware (T4/V100/Xavier/Nano/etc) AWS g4 Large
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) MaskRCNN Resnet50
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here) maskrcnn_train_resnet50_lcv.txt (2.0 KB)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
I added max_num_instances to the config and set it to 300.
The training ran 1 epoch and then gave the same error as above but with increased values
Input to reshape is a tensor with 45xxxxx values, but the requested shape has 40xxxxx
Thereafter I updated the max_num_instances to 600. Now it is simply giving the following error:
[GPU 00] Restoring pretrained weights (265 Tensors)
[MaskRCNN] INFO : Pretrained weights loaded with success…
[MaskRCNN] INFO : Saving checkpoints for 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt.
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-04-13 09:45:22,748 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
I just deleted the experiment_dir_unpruned and re ran the training.
I still got the same error. The training is quitting instantly.
Here is the error:
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[GPU 00] Restoring pretrained weights (265 Tensors)
[MaskRCNN] INFO : Pretrained weights loaded with success…
[MaskRCNN] INFO : Saving checkpoints for 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt.
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-04-13 12:56:53,268 [INFO] tlt.components.docker_handler.docker_handler: Stopping container. `
Hi, same output and error inside the docker using the above commands.
Please let me know if you want me to upload any other files.
[MaskRCNN] WARNING : Checkpoint is missing variable [conv5-mask/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_fcn_logits/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_fcn_logits/bias]
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[GPU 00] Restoring pretrained weights (265 Tensors)
[MaskRCNN] INFO : Pretrained weights loaded with success…
[MaskRCNN] INFO : Saving checkpoints for 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt.
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
I am afraid it is due to out of memory(OOM).
Can you use part of “/workspace/tao-experiments/data/train*.tfrecord” and retry.
BTW, please use a new result folder as well.
I am cleaning out the result folder each time, so its a fresh folder of the same name.
I currently had 256 train*.tfrecords
I moved 236 out of the folder leaving only 20 train*.tfrecords.
The process seems to have run for a few epochs and then quit again. Please see the last few lines of the output below. I feel your solution has partially helped. However,
a) The process still failed…
b) I do need to run the training on thousands of images eventually.
c) This is running on a g4dn.xlarge instance on AWS. I am assuming that the specs of the machine are capable of running such trainings ? Or do I need to do something there?
Thank you again for taking the time to help! Looking forward to your continued help on this :)
I just ran it with these changes. It ran better, but got another error:
[INFO] Epoch 74/4167: loss: 4.12201 learning rate: 0.00449 Time taken: 0:00:02.699245 ETA: 3:04:08.008711
[INFO] Global step 450 (epoch 76/4167): total loss: 8.75753 (rpn score loss: 0.13929 rpn box loss: 0.60771 fast_rcnn class loss: 3.62965 fast_rcnn box loss: 1.00259) learning rate: 0.00455
[INFO] None
[INFO] Epoch 75/4167: loss: 8.75753 learning rate: 0.00455 Time taken: 0:00:02.731563 ETA: 3:06:17.556170
[INFO] None
[INFO] Epoch 76/4167: loss: 6.59582 learning rate: 0.00460 Time taken: 0:00:02.937815 ETA: 3:20:18.599989
[INFO] Global step 460 (epoch 77/4167): total loss: 32.50277 (rpn score loss: 0.30818 rpn box loss: 0.38129 fast_rcnn class loss: 9.58305 fast_rcnn box loss: 19.29635) learning rate: 0.00464
[INFO] None
[INFO] Epoch 77/4167: loss: 81199.67188 learning rate: 0.00466 Time taken: 0:00:02.637393 ETA: 2:59:46.939311
ERROR:tensorflow:Model diverged with loss = NaN.
[INFO] NaN loss during training.
I guess this is another hyperparamter issue. I’ve added the spec file again, if you could guide me on which ones to tweak. Also, I am still running this on only 20 training images. I would like to run it on all 256 images. What do you suggest? maskrcnn_train_resnet50_lcv.txt (2.0 KB)
I think you are training 20 tfreocrds instead of 20 images, right? Normally one tfrecord file will contain multiple images. Please check further. You can simply check the quantity of totally training images and then divide 256 to get how many images in each tfrecord file.
For nan loss, suggest to set a lower learning rate. For example,
init_learning_rate: 0.008
More, for OOM, refer to MaskRCNN - NVIDIA Docs,
you can try reducing the n_workers , shuffle_buffer_size.
Yes. I meant 20 tfrecords! My mistake.
My dataset has only 113 images…
I dropped the init_learning_rate till 0.006.
This helped a bit. But I still got the NaN error after 130 epochs.
Thereafter I tried reducing the n_workers : 4 and added back all the tf records to see if that helps. However, the training quit without any reason like it had before.
I then populated all the .tfrecord file sizes.
After train 80/256 the files were all empty.
Also, randomly 5 files were significantly larger than the rest of the files.
Hence I did the following: removed empty tfrecords (which should not matter) Removed the 5 tfrecords that were significantly larger set init_learning_rate: 0.006. set n_workers : 4
the training now ran successfully uptill 1300 epochs. and then again failed. with the error:
NaN loss during training
There seems to be 2 issues at play here:
Getting all the .tfrecords/ images to load and work correctly.
Getting the training to finish the planned 4,167 epochs without crashing with the NaN error.