Faster RCNN on TLT 3.0 not learning the same as TLT 2.0

quinn · July 28, 2021, 4:36pm

I am working on a custom Faster RCNN model on TLT 3.0. I previously ran this model with identical data and identical network configurations on TLT 2.0, getting an mAP of %96.5. Now the network either learns nothing and stabilizes at a loss of ~1.1, where the prior experiment stabilized at .04, or reaches an nan loss and fails. Here is the experiment spec for the TLT 3.0 experiment

random_seed: 42
    enc_key: 'cPPm_vUm4qGaRpd6kgQX5Dp5S-RKRgh9vp1Y_rQYX2U'
    verbose: True
    model_config {
    input_image_config {
    image_type: RGB
    image_channel_order: 'bgr'
    size_height_width {
    height: 540
    width: 960
    }
        image_channel_mean {
            key: 'b'
            value: 114.54486766972353
    }
        image_channel_mean {
            key: 'g'
            value: 118.13145483368518
    }
        image_channel_mean {
            key: 'r'
            value: 117.67608453228597
    }
    image_scaling_factor: 1
    max_objects_num_per_image: 10
    }
    arch: "resnet:34"
    anchor_box_config {
    scale: 20
    scale: 40
    scale: 90
    ratio: 1.0
    ratio: 0.5
    ratio: 2.0
    }
    freeze_bn: False
    roi_mini_batch: 256
    rpn_stride: 16
    use_bias: True
    roi_pooling_config {
    pool_size: 7
    pool_size_2x: False
    }
    all_projections: True
    use_pooling:False
    }
    dataset_config {
      data_sources: {
        tfrecords_path: "/workspace/TLT/T_23/tfrecords/tfrecord*"
        image_directory_path: "/workspace/DAB/D_7"
      }
    image_extension: 'jpg'
    
                target_class_mapping {
                    key: 'p_1'
                    value: 'p'
                }
            
                target_class_mapping {
                    key: 'p_2'
                    value: 'p'
                }
            
                target_class_mapping {
                    key: 'p_3'
                    value: 'p'
                }
            
                target_class_mapping {
                    key: 'p_4'
                    value: 'p'
                }
            
                target_class_mapping {
                    key: 'p_5'
                    value: 'p'
                }
            
                target_class_mapping {
                    key: 'p_6'
                    value: 'p'
                }
            
                target_class_mapping {
                    key: 'p_7'
                    value: 'p'
                }
            
                target_class_mapping {
                    key: 'p_8'
                    value: 'p'
                }
            
                target_class_mapping {
                    key: 'r_1'
                    value: 'r'
                }
            
                target_class_mapping {
                    key: 'r_2'
                    value: 'r'
                }
            
                target_class_mapping {
                    key: 'r_3'
                    value: 'r'
                }
            
                target_class_mapping {
                    key: 'r_4'
                    value: 'r'
                }
            
                target_class_mapping {
                    key: 'r_5'
                    value: 'r'
                }
            
                target_class_mapping {
                    key: 'r_6'
                    value: 'r'
                }
            
                target_class_mapping {
                    key: 'r_7'
                    value: 'r'
                }
            
                target_class_mapping {
                    key: 'r_8'
                    value: 'r'
                }
            
    validation_fold: 0
    }
    augmentation_config {
    preprocessing {
    output_image_width: 960
    output_image_height: 540
    output_image_channel: 3
    min_bbox_width: 1.0
    min_bbox_height: 1.0
    enable_auto_resize: True
    }
    spatial_augmentation {
    hflip_probability: 0.1
    vflip_probability: 0.1
    zoom_min: 0.9
    zoom_max: 1.1
    translate_max_x: 96
    translate_max_y: 54
    }
    color_augmentation {
    hue_rotation_max: 0
    saturation_shift_max: 0.0
    contrast_scale_max: 0
    contrast_center: 0.5
    }
    }
    training_config {
    checkpoint_interval: 1
    output_model: "/workspace/TLT/T_23/weights/faster_rcnn_resnet_34.tlt"
    enable_augmentation: True
    enable_qat: False
    batch_size_per_gpu: 16
    num_epochs: 20
    rpn_min_overlap: 0.3
    rpn_max_overlap: 0.7
    classifier_min_overlap: 0.0
    classifier_max_overlap: 0.5
    gt_as_roi: False
    std_scaling: 1.0
    classifier_regr_std {
    key: 'x'
    value: 10
    }
    classifier_regr_std {
    key: 'y'
    value: 10
    }
    classifier_regr_std {
    key: 'w'
    value: 5
    }
    classifier_regr_std {
    key: 'h'
    value: 5
    }
    
    rpn_mini_batch: 256
    rpn_pre_nms_top_N: 12000
    rpn_nms_max_boxes: 2000
    rpn_nms_overlap_threshold: 0.7
    
    regularizer {
    
        type: L2
        weight: 0.0001
        
    }
    
    optimizer {
    
        adam {
        lr: 0.00001
        beta_1: 0.9
        beta_2: 0.999
        decay: 0.0
        }
        
    }
    
    learning_rate {
    
        soft_start {
        start_lr: 0.0001
        base_lr: 0.0001
        soft_start: 0.0001
        annealing_points: [0.05, 0.1, 0.15]
        annealing_divider: 1.5
        }
        
    }
    
    lambda_rpn_regr: 1.0
    lambda_rpn_class: 1.0
    lambda_cls_regr: 1.0
    lambda_cls_class: 1.0
    }
    inference_config {
    images_dir: '/workspace/DAB/D_7/test/images'
    model: 'weights/fasterrcnn_resnet34_epoch_001.tlt'
    batch_size: 2
    detection_image_output_dir: '/workspace/TLT/T_23/infer/images'
    labels_dump_dir: '/workspace/TLT/T_23/infer/labels'
    rpn_pre_nms_top_N: 6000
    rpn_nms_max_boxes: 300
    rpn_nms_overlap_threshold: 0.7
    object_confidence_thres: 0.0001
    bbox_visualize_threshold: 0.6
    classifier_nms_max_boxes: 100
    classifier_nms_overlap_threshold: 0.3
    }
    evaluation_config {
    model: 'weights/fasterrcnn_resnet34_epoch_001.tlt'
    batch_size: 16
    validation_period_during_training: 1
    rpn_pre_nms_top_N: 6000
    rpn_nms_max_boxes: 300
    rpn_nms_overlap_threshold: 0.7
    classifier_nms_max_boxes: 100
    classifier_nms_overlap_threshold: 0.3
    object_confidence_thres: 0.0001
    use_voc07_11point_metric: False
    gt_matching_iou_threshold: 0.5
    }

And the corresponding TLT 2.0 spec

random_seed: 42
enc_key: "tlt"
verbose: True
network_config {
        input_image_config {
                image_type: RGB
                image_channel_order: "bgr"
                size_height_width {
                        height: 540
                        width: 960
                }
                image_channel_mean {
                        key: 'b'
                        value: 114.54486766972353
                }
                image_channel_mean {
                        key: 'g'
                        value: 118.13145483368518
                }
                image_channel_mean {
                        key: 'r'
                        value: 117.67608453228597
                }
                image_scaling_factor: 1.0
                max_objects_num_per_image: 10
        }
        feature_extractor: "resnet:34"
        anchor_box_config {
                scale: 20
                scale: 40
                scale: 90
                ratio: 1
                ratio: 0.5
                ratio: 2
        }
        freeze_bn: False
        roi_mini_batch: 256
        rpn_stride: 16
        conv_bn_share_bias: True
        roi_pooling_config: {
                pool_size: 7
                pool_size_2x: False
        }
        all_projections: True
        use_pooling: False
}
training_config {
        kitti_data_config {
                data_sources: {
                        tfrecords_path: "/ze/data/Experiments/TLT/T_113/tfrecords/tfrecords*"
                        image_directory_path: "/ze/data/Experiments/DAB/D_30"
                }
                image_extension: 'jpg'
                target_class_mapping {
                        key: 'r_2'
                        value: 'R'
                }
                target_class_mapping {
                        key: 'p_8'
                        value: 'P'
                }
                target_class_mapping {
                        key: 'r_8'
                        value: 'R'
                }
                target_class_mapping {
                        key: 'r_4'
                        value: 'R'
                }
                target_class_mapping {
                        key: 'r_3'
                        value: 'R'
                }
                target_class_mapping {
                        key: 'p_2'
                        value: 'P'
                }
                target_class_mapping {
                        key: 'p_7'
                        value: 'P'
                }
                target_class_mapping {
                        key: 'p_3'
                        value: 'P'
                }
                target_class_mapping {
                        key: 'r_6'
                        value: 'R'
                }
                target_class_mapping {
                        key: 'p_6'
                        value: 'P'
                }
                target_class_mapping {
                        key: 'r_7'
                        value: 'R'
                }
                target_class_mapping {
                        key: 'p_4'
                        value: 'P'
                }
                target_class_mapping {
                        key: 'p_1'
                        value: 'P'
                }
                target_class_mapping {
                        key: 'p_5'
                        value: 'P'
                }
                target_class_mapping {
                        key: 'r_1'
                        value: 'R'
                }
                target_class_mapping {
                        key: 'r_5'
                        value: 'R'
                }
                validation_fold: 0
        }
        data_augmentation {
                preprocessing {
                        output_image_width: 960
                        output_image_height: 540
                        output_image_channel: 3
                        min_bbox_width: 0.0
                        min_bbox_height: 0.0
}
                spatial_augmentation {
                        hflip_probability: 0.1
                        vflip_probability: 0.1
                        zoom_min: 0.9
                        zoom_max: 1.1
                        translate_max_x: 96
                        translate_max_y: 54
                        rotate_rad_max: 0.261799
                }
                color_augmentation {
                        hue_rotation_max: 0.0
                        saturation_shift_max: 0.0
                        contrast_scale_max: 0.0
                        contrast_center: 0.0
                }
        }
        enable_augmentation: True
        batch_size_per_gpu: 2
        num_epochs: 20
        pretrained_weights: "/ze/data/pretrained_models/resnet_34.hdf5"
        output_model: "/ze/data/Experiments/TLT/T_113/models/model.tlt"
        rpn_min_overlap: 0.3
        rpn_max_overlap: 0.7
        classifier_min_overlap: 0
        classifier_max_overlap: 0.5
      gt_as_roi: False
        std_scaling: 1
        classifier_regr_std {
                key: 'x'
                value: 10
        }
        classifier_regr_std {
                key: 'y'
                value: 10
        }
        classifier_regr_std {
                key: 'w'
                value: 5
        }
        classifier_regr_std {
                key: 'h'
                value: 5
        }
        rpn_mini_batch: 256
        rpn_pre_nms_top_N: 3000
        rpn_nms_max_boxes: 500
        rpn_nms_overlap_threshold: 0.6
        reg_config {
                type: L2
                weight: 0.0001
        }
        optimizer {
                adam {
                        lr: 0.0001
                        beta_1: 0.9
                        beta_2: 0.999
                        decay: 0.0
                }
        }
         lr_scheduler {
                soft_start {
                        base_lr: 0.0001
                        start_lr: 0.0001
                        soft_start: 0.0001
                        annealing_points: 0.05
                        annealing_points: 0.1
                        annealing_points: 0.15
                        annealing_points: 0.2
               }
        }
        lambda_rpn_regr: 1.0
        lambda_rpn_class: 1.0
        lambda_cls_regr: 1.0
        lambda_cls_class: 1.0
        inference_config {
                images_dir: '/ze/data/Experiments/DAB/D_30/test/images'
                model: '/ze/data/Experiments/TLT/T_113/models/model.epoch17.tlt'
                detection_image_output_dir: '/ze/data/Experiments/DAB/D_30/infer/images'
                labels_dump_dir: '/ze/data/Experiments/DAB/D_30/infer/labels'
                rpn_pre_nms_top_N: 6000
                rpn_nms_max_boxes: 300
                rpn_nms_overlap_threshold: 0.7
                bbox_visualize_threshold: 0.6
                classifier_nms_max_boxes: 300
                classifier_nms_overlap_threshold: 0.3
        }
        evaluation_config {
                model: '/ze/data/Experiments/TLT/T_113/models/model.epoch17.tlt'
                labels_dump_dir: '/ze/data/Experiments/TLT/T_113/eval/labels'
                rpn_pre_nms_top_N: 3000
                rpn_nms_max_boxes: 500
                rpn_nms_overlap_threshold: 0.6
                classifier_nms_max_boxes: 300
                classifier_nms_overlap_threshold: 0.3
                object_confidence_thres: 0.0001
                use_voc07_11point_metric: False
        }
}

quinn · July 29, 2021, 3:17am

Update: This only seems to happen when AMP is enabled

Morganh · July 29, 2021, 9:21am

So, do you mean

when AMP is enabled, the network either learns nothing and stabilizes at a loss of ~1.1 ?
when AMP is disabled, the network can work well ?

Could you share the full log when AMP is enabled?

quinn · July 29, 2021, 3:55pm

I believe my issue was not having the ‘pretrained_weights’ configuration set. However, when I do set it to the hdf5 files from NGC I get this error

ValueError: Layer #1 (named "conv1") expects 2 weight(s), but the saved weights have 1 element(s).

Morganh · July 29, 2021, 4:05pm

Please make sure you download the correct pretrained model according to NVIDIA TAO Documentation when you run TLT 3.0.

quinn · July 29, 2021, 4:13pm

It is the correct model I believe, resnet 18 and 34 gotten through NGC CLI using:

ngc registry model download-version nvidia/tlt_pretrained_object_detection:resnet18

These pretrained model files were successfully used for Yolov3/4 and SSD on TLT 3.0

Morganh · July 29, 2021, 4:20pm

Please double check your training spec file. If possible, please download the spec files from https://docs.nvidia.com/tlt/tlt-user-guide/text/tlt_quick_start_guide.html#use-the-examples

wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/tlt_cv_samples/versions/v1.1.0/zip -O tlt_cv_samples_v1.1.0.zip
unzip -u tlt_cv_samples_v1.1.0.zip -d ./tlt_cv_samples_v1.1.0 && rm -rf tlt_cv_samples_v1.1.0.zip && cd ./tlt_cv_samples_v1.1.0

quinn · July 29, 2021, 6:10pm

Confirmed the error that I was receiving was due to having freeze_bn set to false. Repeating the experiment with AMP to confirm that learning is correct

Morganh · July 30, 2021, 3:24am

Thanks for the info. Great job!

quinn · July 30, 2021, 2:23pm

Still seeing poor performance when using AMP. After the first epoch this are my metrics without AMP:

==========================================================================================
Class               AP                  precision           recall              RPN_recall          
------------------------------------------------------------------------------------------
P                   0.7289              0.0211              0.8630              0.9595              
------------------------------------------------------------------------------------------
R                   0.8701              0.0361              0.9317              0.9656              
------------------------------------------------------------------------------------------

And the identical experiment with AMP:

==========================================================================================
Class               AP                  precision           recall              RPN_recall          
------------------------------------------------------------------------------------------
P                   0.0000              0.0000              0.0000              0.9062              
------------------------------------------------------------------------------------------
R                   0.0000              0.0000              0.0000              0.9299              
------------------------------------------------------------------------------------------

Though the RPN Recall is higher, all other metrics are at 0. Also even for the experiment without AMP with a high AP the precision scores are still incredibly low. Why is this?

Morganh · August 2, 2021, 6:47am

For the training without AMP, please focus on mAP only.
For the training with AMP, which dgpu are you using?

quinn · August 2, 2021, 1:18pm

I am using an RTX 3090

Morganh · August 4, 2021, 4:26pm

For the training with AMP, can you reproduce mAP 0 when train the public KITTI dataset?

quinn · August 6, 2021, 6:29pm

I was able to successfully run the KITTI example with AMP enabled and working propery. However, it still fails when using custom data. I am going to match up the KITTI config and the custom data config to attempt to find an experiment configuration error that may be causing this issue

Morganh · August 8, 2021, 10:57am

Thanks for the info.

Topic		Replies	Views
Training Custom FasterRCNN resnet50 Object detection issue TAO Toolkit	9	1126	October 12, 2021
Faster RCNN ResNet-101 Problems TAO Toolkit	20	1125	October 12, 2021
GRAYSCALE as image_type not working with tlt-train faster_rcnn TAO Toolkit	13	680	October 12, 2021
FasterRCNN evaluation when validation set contains just one class and the model was trained with 80 classes TAO Toolkit	11	518	December 4, 2021
TLT trained model accuracy worse after deployment TAO Toolkit	11	836	October 12, 2021
Poor metric results after retraining maskrcnn using TLT notebook TAO Toolkit	23	2412	August 3, 2021
Model performance is terrible when using 8 gpus TAO Toolkit	4	363	October 12, 2021
High ram usage with tlt ResNet TAO Toolkit	42	1019	July 6, 2022
Invalid decryption. Unable to open file (file signature not found). The key used to load the model is incorrect TAO Toolkit	3	671	October 12, 2021
FasterRCNN TLT V3 error while training TAO Toolkit	2	429	October 12, 2021

Faster RCNN on TLT 3.0 not learning the same as TLT 2.0

Related topics