Low accuracy for MS COCO dataset in tao maskrcnn model training

• Hardware A5000
• Network Type Mask_rcnn
• TLT Version nvidia/tao/tao-toolkit-tf: v3.22.05-tf1.15.5-py3
• Training spec file
tao_maskrcnn_02_09_24_train_v6.txt (2.4 KB)
tao_maskrcnn_02_09_24_train_v7.txt (2.4 KB)

This is a trend I have observed while training with the MS COCO dataset. The dataset was filtered to include only “truck”, “bus” and “person” classes. TFRecords were then generated for training and validation using the tao command line command. Since I was trying out training for the first time for this model, I initially planned to run training with the default config file in the documentation MaskRCNN - NVIDIA Docs , but this failed as I was getting an error telling me the training loss has gone to NaN in the very first iteration. The training configs I have attached have much lower loss values compared to the values mentioned in the above documentation. In the case of the training config ending with v6, the training loss was jumping around a lot so I reduced the values again by a factor of 10 and you will find this update in the config ending with v7 and the loss values stopped jumping everywhere. In both the above trainings I have noticed that the loss value does’nt come down and stays around 3, while running inference on the validation dataset with the final model file there are’nt any detections or segmentations in the output for any of the validation images and the AP values computed are near 0. Since there is no loss propagation I am also unable to select the right model to test and I get to know that the training is not happening properly. What could be the reason for this behaviour ?
Thanks

Please refer to the spec in Poor metric results after retraining maskrcnn using TLT notebook - #17 by Morganh.

Iam not running training on 2 GPUs, but on a single GPU. How will the specs be changed, can you also explain how training config changes between running training on a single GPU vs multiple GPUs. Also how would I go about changing the values for steps mentioned in multiple parts of the config file since the number of images I have in the dataset is 16,270 compared to 118,287 in the original coco dataset.

You can refer to Poor metric results after retraining maskrcnn using TLT notebook - #6 by Morganh.

You can use the same spec file.

num_examples_per_epoch will be the number of images in my dataset, right ?

I ran the training with the above changes and the loss was going down, but there was a sudden increase in the loss value (fast_box_loss jumped from 1.3 to 214.06) in a step and in the later step, the training ended with training loss gone to Nan message. Look at the following screenshot

It is the total number of images in the training set divided by the number of GPUs. Please refer to MaskRCNN - NVIDIA Docs.

I think you are running your own dataset now instead of original COCO dataset. I suggest you to do more experiments to narrow down.

  1. Check if the labels are correct.
  2. Train with different part of your current dataset. For example, using 1/10, 1/5 , 1/3, 1/2 of it to check if there is Nan issue.

As mentioned in the question and in my subsequent replies, my dataset is a derivative of the MS coco dataset filtered to include only 3 classes. I have already verified my dataset labels since there was a prior issue with training ending with Nan loss, giving the link to the thread here Tao mask_rcnn training exits with NaN loss - #4 by Morganh . Inorder to fix the Nan loss I brought down the learning rate Tao mask_rcnn training exits with NaN loss - #11 by adithya.ajith

The training exited while running the second epoch, so the model had already gone through the dataset once since an epoch was already completed. This should mean that the issue is not with the dataset, is my understanding correct. If so how can we debug this ?

Training with less dataset is in order to narrow down the issue.
One potential issue in your spec file is that you only set num_classes: 3 . But you mentioned that your dataset is a derivative of the MS coco dataset filtered to include only 3 classes. In this case, please set num_classes: 4.
See https://docs.nvidia.com/tao/tao-toolkit-archive/tao-30-2205/text/instance_segmentation/mask_rcnn.html#data-config,

The number of classes. If there are N categories in the annotation, num_classes should be N+1 (background class)

num_classes was the issue after the fix loss started reducing and inference results looks good.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.