Tensorboard not displaying all info in TF2 Object Detection API Training

Description

I had a setup for training using the object detection API that worked really well, however I have had to upgrade from TF1.15 to TF2 and so instead of using model_main.py I am now using model_main_tf2.py and using mobilenet ssd 320x320 pipeline to transfer train a new model.

When training my model in TF1.15 it would display a whole heap of scalars as well as detection box image samples. It was fantastic.

In TF2 training I get no such data, just loss scalars and 3 input images!! and yet the event files are huge gigabytes! where as they were in hundreds of megs using TF1.15

The thing is there is nowhere to specify what data is presented. I have not changed anything other than which model_main py file I use to run the training. I added num_visualizations: to the pipeline config file but no visualizations of detection boxes appear.

Can someone please explain to me what is going on? I need to be able to see whats happening throughout training!

Thank You

I am training on PC in virtual environment before performing TRT optimization in Linux but I think that is irrelevant here really.

Environment

GPU Type: P220
Operating System + Version: Win10 Pro
Python Version (if applicable): 3.6
TensorFlow Version (if applicable): 2

Relevant Files

TF1.15 vs TF2 screenshots:

Steps To Reproduce

The repo I am working with https://github.com/tensorflow/models/tree/master/research/object_detection

pipeline config:

# SSD with Mobilenet v2
# Trained on COCO17, initialized from Imagenet classification checkpoint
# Train on TPU-8
#
# Achieves 22.2 mAP on COCO17 Val

model {
  ssd {
    inplace_batchnorm_update: true
    freeze_batchnorm: false
    num_classes: 2
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    encode_background_as_zeros: true
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.2
        max_scale: 0.95
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.3333
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    box_predictor {
      convolutional_box_predictor {
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.8
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
        class_prediction_bias_init: -4.6
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            random_normal_initializer {
              stddev: 0.01
              mean: 0.0
            }
          }
          batch_norm {
            train: true,
            scale: true,
            center: true,
            decay: 0.97,
            epsilon: 0.001,
          }
        }
      }
    }
    feature_extractor {
      type: 'ssd_mobilenet_v2_keras'
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          train: true,
          scale: true,
          center: true,
          decay: 0.97,
          epsilon: 0.001,
        }
      }
      override_base_feature_extractor_hyperparams: true
    }
    loss {
      classification_loss {
        weighted_sigmoid_focal {
          alpha: 0.75,
          gamma: 2.0
        }
      }
      localization_loss {
        weighted_smooth_l1 {
          delta: 1.0
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    normalize_loc_loss_by_codesize: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  fine_tune_checkpoint_version: V2
  fine_tune_checkpoint: "legacy/ssd_mobilenet_v2_320x320_coco17/checkpoint/ckpt-0"
  fine_tune_checkpoint_type: "detection"
  batch_size: 12
  sync_replicas: true
  startup_delay_steps: 0
  replicas_to_aggregate: 8
  num_steps: 70000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        cosine_decay_learning_rate {
          learning_rate_base: .8
          total_steps: 70000
          warmup_learning_rate: 0.13333
          warmup_steps: 2000
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
}

train_input_reader: {
  label_map_path: "legacy/training/object-detection.pbtxt"
  tf_record_input_reader {
    input_path: "legacy/data/train.record"
  }
}

eval_config: {
  metrics_set: "coco_detection_metrics"
  retain_original_images: false
  use_moving_averages: false
  num_visualizations: 45
  min_score_threshold: 0.35
  max_evals: 10
}

eval_input_reader: {
  label_map_path: "legacy/training/object-detection.pbtxt"
  shuffle: false
  num_epochs: 1
  tf_record_input_reader {
    input_path: "legacy/data/test.record"
  }
}

UPDATE: I have investigated further and discovered that the tensorboard settings are being set in https://github.com/tensorflow/models/blob/master/research/object_detection/model_lib.py for TF1.15 and https://github.com/tensorflow/models/blob/master/research/object_detection/model_lib_v2.py for TF2

So if someone who knows more than I do about this could work out what the difference is and what I need to do to get same result in tensorboard with v2 as I do with the first one that would be amazing and save me enormous headache. It would seem that this, even though it is documented as being for TF2, is not actually following TF2 syntax but I could be wrong.

Hi,
Please check the below link, as they might answer your concerns

Thanks!

thanks but that has absolutely nothing to do with my quesiton.

Here is the Jetson Nano forum, but issue is TF o P220 which is out of our support scope.
Please try to check if can get support from Newest 'tensorflow' Questions - Stack Overflow

Hi,

The summary API (tensorboard) has some update in the TensorFlow v2.
Which branch do you use? Is it master?

Could you help to verify if this change (add summary to model_main_tf2.py) is included in your source or not first?

https://github.com/tensorflow/models/commit/ee3bfa1ecf5c0d2296bb5d32e7b34d9e0b4b0205

Thanks.

Thank You,

I do have the updated version yes. I am wondering if this is to do with the fact TF2 uses eager mode? As the only place I can find where it is writing image summary of 3 samples is in def eager_train_step( in line 275 of model_lib_v2.py. I read somewhere training in eager mode doesn’t draw scalars and detection images in tensorboard but does store it still? I don’t know if this is true or what to do to remedy it.

The event files (.v2) are very large, about 1 gig per 10k steps. So it must be storing something…

the good news is the training works and the model performs well in inferencing. I just need the visuals in tensorboard to optimize it.

Thanks for your help!

Hi,

Sorry for the late.
Based on the source, the summary do add in the eager training mode:

Have you checked this issue with TensorFlow team?
They may know more about the difference in the summary API.

Thanks.

Hi,

You can run another cmd with the checkpoint_dir parameter added whilst the training is running, this will add the visuals too,

py model_main_tf2.py --model_dir=models\ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8 --pipeline_config_path=models\ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8\pipeline.config --checkpoint_dir=models\ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8\

Thanks

1 Like