Tao mask_rcnn training exits with NaN loss

• Hardware A5000
• Network Type Mask_rcnn
• TLT Version nvidia/tao/tao-toolkit-tf: v3.22.05-tf1.15.5-py3
• Training spec file
tao_maskrcnn_02_09_24_train.txt (2.4 KB)

• How to reproduce the issue ?

While I run the training it exits with “NaN loss during training.” refer the training log for the complete console out
train.log (19.9 KB)
I refered the following thread Mask R-CNN stops abruptly while training using custom coco dataset - #13 by Morganh and removed “images” and “annotations” entries where “segmentation” was found to be in json format, there were’nt any empty “segmentation” entries.

Additionally this is the snippet from the console out when I generated the tfrecords using the tao mask_rcnn dataset_convert command

INFO:tensorflow:writing to output path: /workspace/tao-experiments/dataset/coco_val_2017/tfrecords_maskrcnn/instances_val2017_car_truck_bus INFO:tensorflow:writing to output path: /workspace/tao-experiments/dataset/coco_val_2017/tfrecords_maskrcnn/instances_val2017_car_truck_bus INFO:tensorflow:Building bounding box index. INFO:tensorflow:Building bounding box index. INFO:tensorflow:0 images are missing bboxes. INFO:tensorflow:0 images are missing bboxes. INFO:tensorflow:On image 0 of 691 INFO:tensorflow:On image 0 of 691 INFO:tensorflow:On image 100 of 691 INFO:tensorflow:On image 100 of 691 INFO:tensorflow:On image 200 of 691 INFO:tensorflow:On image 200 of 691 INFO:tensorflow:On image 300 of 691 INFO:tensorflow:On image 300 of 691 INFO:tensorflow:On image 400 of 691 INFO:tensorflow:On image 400 of 691 INFO:tensorflow:On image 500 of 691 INFO:tensorflow:On image 500 of 691 INFO:tensorflow:On image 600 of 691 INFO:tensorflow:On image 600 of 691 INFO:tensorflow:Finished writing, skipped 0 annotations.
INFO:tensorflow:Finished writing, skipped 0 annotations.

You can find the messge 0 images are missing bboxes.

I am also attaching the json for you to further explore
instances_val2017_car_truck_bus.txt (3.7 MB)

Thanks

Please set lower bs and retry.
train_batch_size: 1
eval_batch_size: 1

These changes did not work, still getting the same error.

Make sure

  • the id under categories in the annotation file should start from 1.
  • In annotations dict, the category_id should start from 1.

After updating the json, json for reference
instances_val2017_car_truck_bus.txt (3.7 MB)
, the same error pops

Please delete num_epochs: 16 and then set total_steps: 250000 and retry.
Refer to tao_tutorials/notebooks/tao_launcher_starter_kit/mask_rcnn/specs/maskrcnn_train_resnet50.txt at cdbafd28fec9da67fbfc4db9288ec0805076ce29 · NVIDIA/tao_tutorials · GitHub

Does’nt work, same error

Did you delete num_epochs: 16? I just modify my previous comment.

yes I deleted

Can you set a lower init_learning_rate, for example, 0.005?
More, did you ever run the COCO dataset successfully mentioned in the notebook?

Training started when I updated the init_learning rate with a much lower value of 0.0005