Low accuracy for MS COCO dataset in tao maskrcnn model training

adithya.ajith · September 11, 2024, 7:13am

• Hardware A5000
• Network Type Mask_rcnn
• TLT Version nvidia/tao/tao-toolkit-tf: v3.22.05-tf1.15.5-py3
• Training spec file
tao_maskrcnn_02_09_24_train_v6.txt (2.4 KB)
tao_maskrcnn_02_09_24_train_v7.txt (2.4 KB)

This is a trend I have observed while training with the MS COCO dataset. The dataset was filtered to include only “truck”, “bus” and “person” classes. TFRecords were then generated for training and validation using the tao command line command. Since I was trying out training for the first time for this model, I initially planned to run training with the default config file in the documentation MaskRCNN - NVIDIA Docs , but this failed as I was getting an error telling me the training loss has gone to NaN in the very first iteration. The training configs I have attached have much lower loss values compared to the values mentioned in the above documentation. In the case of the training config ending with v6, the training loss was jumping around a lot so I reduced the values again by a factor of 10 and you will find this update in the config ending with v7 and the loss values stopped jumping everywhere. In both the above trainings I have noticed that the loss value does’nt come down and stays around 3, while running inference on the validation dataset with the final model file there are’nt any detections or segmentations in the output for any of the validation images and the AP values computed are near 0. Since there is no loss propagation I am also unable to select the right model to test and I get to know that the training is not happening properly. What could be the reason for this behaviour ?
Thanks

Morganh · September 11, 2024, 7:29am

Please refer to the spec in Poor metric results after retraining maskrcnn using TLT notebook - #17 by Morganh.

adithya.ajith · September 11, 2024, 9:09am

Iam not running training on 2 GPUs, but on a single GPU. How will the specs be changed, can you also explain how training config changes between running training on a single GPU vs multiple GPUs. Also how would I go about changing the values for steps mentioned in multiple parts of the config file since the number of images I have in the dataset is 16,270 compared to 118,287 in the original coco dataset.

Morganh · September 11, 2024, 9:32am

You can refer to Poor metric results after retraining maskrcnn using TLT notebook - #6 by Morganh.

You can use the same spec file.

adithya.ajith · September 11, 2024, 10:27am

num_examples_per_epoch will be the number of images in my dataset, right ?

adithya.ajith · September 11, 2024, 1:04pm

I ran the training with the above changes and the loss was going down, but there was a sudden increase in the loss value (fast_box_loss jumped from 1.3 to 214.06) in a step and in the later step, the training ended with training loss gone to Nan message. Look at the following screenshot

Morganh · September 11, 2024, 4:27pm

It is the total number of images in the training set divided by the number of GPUs. Please refer to MaskRCNN - NVIDIA Docs.

I think you are running your own dataset now instead of original COCO dataset. I suggest you to do more experiments to narrow down.

Check if the labels are correct.
Train with different part of your current dataset. For example, using 1/10, 1/5 , 1/3, 1/2 of it to check if there is Nan issue.

adithya.ajith · September 12, 2024, 5:16am

As mentioned in the question and in my subsequent replies, my dataset is a derivative of the MS coco dataset filtered to include only 3 classes. I have already verified my dataset labels since there was a prior issue with training ending with Nan loss, giving the link to the thread here Tao mask_rcnn training exits with NaN loss - #4 by Morganh . Inorder to fix the Nan loss I brought down the learning rate Tao mask_rcnn training exits with NaN loss - #11 by adithya.ajith

The training exited while running the second epoch, so the model had already gone through the dataset once since an epoch was already completed. This should mean that the issue is not with the dataset, is my understanding correct. If so how can we debug this ?

Morganh · September 12, 2024, 7:23am

Training with less dataset is in order to narrow down the issue.
One potential issue in your spec file is that you only set num_classes: 3 . But you mentioned that your dataset is a derivative of the MS coco dataset filtered to include only 3 classes. In this case, please set num_classes: 4.
See https://docs.nvidia.com/tao/tao-toolkit-archive/tao-30-2205/text/instance_segmentation/mask_rcnn.html#data-config,

The number of classes. If there are N categories in the annotation, num_classes should be N+1 (background class)

adithya.ajith · September 17, 2024, 5:01am

num_classes was the issue after the fix loss started reducing and inference results looks good.

system · October 1, 2024, 5:01am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tao mask_rcnn training exits with NaN loss TAO Toolkit	12	84	September 18, 2024
Maskrcnn.ipynb - followed notebook and ended up with poor (almost untrained) network from instructions TAO Toolkit	13	760	October 12, 2021
MaskRCNN Input to reshape is a tensor with 3135248 values, but the requested shape has 2691200 TAO Toolkit	38	1124	May 9, 2023
Tao unet model outputs only one class TAO Toolkit	6	30	August 15, 2024
Retraining Error after pruning the Mask RCNN model with TAO Toolkit TAO Toolkit tao	5	508	May 10, 2022
Training doesn't converge for Mapillary Vistas Dataset training with MaskRCNN TAO Toolkit	47	1721	June 16, 2022
Poor metric results after retraining maskrcnn using TLT notebook TAO Toolkit	23	2412	August 3, 2021
LPRNet: Invalid loss, terminating training TAO Toolkit	24	2176	January 5, 2022
Very low evaluation results for dino model by dino.ipynb in tao-getting-started_v5.3 TAO Toolkit	13	182	August 9, 2024
Poor performance of MaskRCNN on images TAO Toolkit	16	1331	October 12, 2021

Low accuracy for MS COCO dataset in tao maskrcnn model training

Related topics