This is a trend I have observed while training with the MS COCO dataset. The dataset was filtered to include only “truck”, “bus” and “person” classes. TFRecords were then generated for training and validation using the tao command line command. Since I was trying out training for the first time for this model, I initially planned to run training with the default config file in the documentation MaskRCNN - NVIDIA Docs , but this failed as I was getting an error telling me the training loss has gone to NaN in the very first iteration. The training configs I have attached have much lower loss values compared to the values mentioned in the above documentation. In the case of the training config ending with v6, the training loss was jumping around a lot so I reduced the values again by a factor of 10 and you will find this update in the config ending with v7 and the loss values stopped jumping everywhere. In both the above trainings I have noticed that the loss value does’nt come down and stays around 3, while running inference on the validation dataset with the final model file there are’nt any detections or segmentations in the output for any of the validation images and the AP values computed are near 0. Since there is no loss propagation I am also unable to select the right model to test and I get to know that the training is not happening properly. What could be the reason for this behaviour ?
Thanks
Iam not running training on 2 GPUs, but on a single GPU. How will the specs be changed, can you also explain how training config changes between running training on a single GPU vs multiple GPUs. Also how would I go about changing the values for steps mentioned in multiple parts of the config file since the number of images I have in the dataset is 16,270 compared to 118,287 in the original coco dataset.
I ran the training with the above changes and the loss was going down, but there was a sudden increase in the loss value (fast_box_loss jumped from 1.3 to 214.06) in a step and in the later step, the training ended with training loss gone to Nan message. Look at the following screenshot
The training exited while running the second epoch, so the model had already gone through the dataset once since an epoch was already completed. This should mean that the issue is not with the dataset, is my understanding correct. If so how can we debug this ?