Training doesn't converge for Mapillary Vistas Dataset training with MaskRCNN

I am training Mapillary Vistas Dataset with MaskRCNN for instance segmentation.
Training accuracy is very low.

AP: 0.064670101
AP50: 0.112756371
AP75: 0.061653189
APl: 0.090381607
APm: 0.039630000
APs: 0.009333231
ARl: 0.256172895
ARm: 0.078267403
ARmax1: 0.119873106
ARmax10: 0.166767538
ARmax100: 0.168501124
ARs: 0.012933115
mask_AP: 0.045311652
mask_AP50: 0.082073040
mask_AP75: 0.041109238
mask_APl: 0.065103896
mask_APm: 0.015935341
mask_APs: 0.000883913
mask_ARl: 0.156799093
mask_ARm: 0.041575737
mask_ARmax1: 0.077029452
mask_ARmax10: 0.096323162
mask_ARmax100: 0.096980833
mask_ARs: 0.002640154

Mapillary dataset is converted to COCO format using https://github.com/Luodian/Mapillary2COCO
Then change to tfrecords and train.

How can I improve the training?
The training log file and config file are attached.
log_2.txt (1.9 MB)
maskrcnn_train_resnet50_2.txt (2.0 KB)

Can you set to below?
num_classes: 124

Refer to TLT train maskrcnn model with Mapillary Vistas Dataset failed on CUDA_ERROR_OUT_OF_MEMORY: out of memory - #52 by Morganh

I can’t change to 124 and once changed, I have error as in 124Classes.txt attached file.
Since my number of classes in conversion from Mapillary format to COC, there are 37 classes and plus 1 for background so I have 38 classes. Pls see in another attached file.
124Classes.txt (12.1 KB)
instance_label.txt (3.4 KB)

Please try to train with my attached tfrecords and json file. See

Thank you. I’ll try. I set 124 classes?

Yes.

There are no tfrecords for validation. Can pls pass me validation records also?

You can split the tfrecords files and select a small part of them as validation file.

ok i’ll change some of them to validation.

I have this error as shown in the attached file using your data.
train.txt (12.1 KB)

Can you share the command?

print(“For multi-GPU, change --gpus based on your machine.”)
!tao mask_rcnn train -e $SPECS_DIR/maskrcnn_train_resnet50.txt
-d $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
–gpus 1

Please mkdir another result folder and retry.

Still the same error. Previously I made sure no earlier trained models are inside.

Can you share your latest spec file?

It is attached. Thank you
maskrcnn_train_resnet50.txt (2.0 KB)

Please set
skip_crowd_during_training: true

and update my previous comment, need to set
num_classes: 125

Training starts and loss becomes bigger.
Is it learning rate to be tuned?
All are logged in attached file.
train.txt (187.8 KB)

My dataset is not full. You can retry your dataset.

So I change back to my own dataset then what is modification for me. That true/false setting was I set true when I first train with my own dataset. Just change to false and forgot to change back.
My original problem of training loss not converging is not solved yet.