Mask rcnn poor results

I trained 300 images with mask rcnn but the results are very poor. The mAP is up to 40% with other models than TAO. This is my results after 140 iterations:

DLL 2021-09-19 16:26:47.189391 - Iteration: 140 Validation Iteration: 140  AP : 0.017194924876093864
DLL 2021-09-19 16:26:47.189477 - Iteration: 140 Validation Iteration: 140  AP50 : 0.0821227878332138
DLL 2021-09-19 16:26:47.189502 - Iteration: 140 Validation Iteration: 140  AP75 : 0.0016183353727683425
DLL 2021-09-19 16:26:47.189525 - Iteration: 140 Validation Iteration: 140  APs : 0.0
DLL 2021-09-19 16:26:47.189548 - Iteration: 140 Validation Iteration: 140  APm : 1.05388689917163e-05
DLL 2021-09-19 16:26:47.189570 - Iteration: 140 Validation Iteration: 140  APl : 0.029721815139055252
DLL 2021-09-19 16:26:47.189590 - Iteration: 140 Validation Iteration: 140  ARmax1 : 0.03628117963671684
DLL 2021-09-19 16:26:47.189609 - Iteration: 140 Validation Iteration: 140  ARmax10 : 0.07324262708425522
DLL 2021-09-19 16:26:47.189628 - Iteration: 140 Validation Iteration: 140  ARmax100 : 0.11995464563369751
DLL 2021-09-19 16:26:47.189648 - Iteration: 140 Validation Iteration: 140  ARs : 0.0
DLL 2021-09-19 16:26:47.189666 - Iteration: 140 Validation Iteration: 140  ARm : 0.006603773683309555
DLL 2021-09-19 16:26:47.189685 - Iteration: 140 Validation Iteration: 140  ARl : 0.2165975123643875
DLL 2021-09-19 16:26:47.189706 - Iteration: 140 Validation Iteration: 140  mask_AP : 0.0018066676566377282
DLL 2021-09-19 16:26:47.189725 - Iteration: 140 Validation Iteration: 140  mask_AP50 : 0.009883023798465729
DLL 2021-09-19 16:26:47.189745 - Iteration: 140 Validation Iteration: 140  mask_AP75 : 0.00011187559721292928
DLL 2021-09-19 16:26:47.189764 - Iteration: 140 Validation Iteration: 140  mask_APs : 4.041220563522074e-06
DLL 2021-09-19 16:26:47.189784 - Iteration: 140 Validation Iteration: 140  mask_APm : 0.0
DLL 2021-09-19 16:26:47.189803 - Iteration: 140 Validation Iteration: 140  mask_APl : 0.003247086890041828
DLL 2021-09-19 16:26:47.189822 - Iteration: 140 Validation Iteration: 140  mask_ARmax1 : 0.009977323934435844
DLL 2021-09-19 16:26:47.189842 - Iteration: 140 Validation Iteration: 140  mask_ARmax10 : 0.021995464339852333
DLL 2021-09-19 16:26:47.189861 - Iteration: 140 Validation Iteration: 140  mask_ARmax100 : 0.02448979578912258
DLL 2021-09-19 16:26:47.189880 - Iteration: 140 Validation Iteration: 140  mask_ARs : 0.0010638297535479069
DLL 2021-09-19 16:26:47.189899 - Iteration: 140 Validation Iteration: 140  mask_ARm : 0.0
DLL 2021-09-19 16:26:47.189918 - Iteration: 140 Validation Iteration: 140  mask_ARl : 0.044398341327905655

And this is my spec file:

seed: 123
use_amp: False
warmup_steps: 1000
checkpoint: "/workspace/tlt/tlt-experiments/all_segmentation_approaches/pretrained_weights/resnet_50.hdf5"
learning_rate_steps: "[15000, 25000]"
learning_rate_decay_levels: "[0.1, 0.01]"
total_steps: 120000
train_batch_size: 1
eval_batch_size: 1
num_steps_per_eval: 5
momentum: 0.9
l2_weight_decay: 0.0001
warmup_learning_rate: 0.001
init_learning_rate: 0.0025

data_config{
    image_size: "(448, 448)"
    augment_input_data: False
    eval_samples: 135
    training_file_pattern: "/workspace/tlt/tlt-experiments/all_segmentation_approaches/corrosion_v2_temp_mask/tfrecords/train/*.tfrecord"
    validation_file_pattern: "/workspace/tlt/tlt-experiments/all_segmentation_approaches/corrosion_v2_temp_mask/tfrecords/val/*.tfrecord"
    val_json_file: "/workspace/tlt/tlt-experiments/all_segmentation_approaches/corrosion_v2_temp_mask/val/val_coco.json"

    # dataset specific parameters
    num_classes: 2
    skip_crowd_during_training: True
}

maskrcnn_config {
    nlayers: 50
    arch: "resnet"
    freeze_bn: False
    #freeze_blocks: "[0,1]"
    gt_mask_size: 112
        
    # Region Proposal Network
    rpn_positive_overlap: 0.7
    rpn_negative_overlap: 0.3
    rpn_batch_size_per_im: 256
    rpn_fg_fraction: 0.5
    rpn_min_size: 0.

    # Proposal layer.
    batch_size_per_im: 512
    fg_fraction: 0.25
    fg_thresh: 0.5
    bg_thresh_hi: 0.5
    bg_thresh_lo: 0.

    # Faster-RCNN heads.
    fast_rcnn_mlp_head_dim: 1024
    bbox_reg_weights: "(10., 10., 5., 5.)"

    # Mask-RCNN heads.
    include_mask: True
    mrcnn_resolution: 28

    # training
    train_rpn_pre_nms_topn: 2000
    train_rpn_post_nms_topn: 1000
    train_rpn_nms_threshold: 0.7

    # evaluation
    test_detections_per_image: 100
    test_nms: 0.5
    test_rpn_pre_nms_topn: 1000
    test_rpn_post_nms_topn: 1000
    test_rpn_nms_thresh: 0.7

    # model architecture
    min_level: 2
    max_level: 6
    num_scales: 1
    aspect_ratios: "[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]"
    anchor_scale: 8

    # localization loss
    rpn_box_loss_weight: 1.0
    fast_rcnn_box_loss_weight: 1.0
    mrcnn_weight_loss_mask: 1.0
}


For your case, the 140th iteration is the very beginning of training. Please wait for the further result during training.
More reference: Poor metric results after retraining maskrcnn using TLT notebook - #16 by ghazni

Thank you.
So are all the hyperparameters correct for 1 gpu?

That reference is talking about training COCO dataset. For your case, it is different. You are training your own images but only 300 images. Although it is a bit less, you can still train with its spec and monitor the loss and AP.

1 Like