Tensorboard not displaying all info in TF2 Object Detection API Training

TP2049 · March 24, 2021, 10:51pm

Description

I had a setup for training using the object detection API that worked really well, however I have had to upgrade from TF1.15 to TF2 and so instead of using model_main.py I am now using model_main_tf2.py and using mobilenet ssd 320x320 pipeline to transfer train a new model.

When training my model in TF1.15 it would display a whole heap of scalars as well as detection box image samples. It was fantastic.

In TF2 training I get no such data, just loss scalars and 3 input images!! and yet the event files are huge gigabytes! where as they were in hundreds of megs using TF1.15

The thing is there is nowhere to specify what data is presented. I have not changed anything other than which model_main py file I use to run the training. I added num_visualizations: to the pipeline config file but no visualizations of detection boxes appear.

Can someone please explain to me what is going on? I need to be able to see whats happening throughout training!

Thank You

I am training on PC in virtual environment before performing TRT optimization in Linux but I think that is irrelevant here really.

Environment

GPU Type: P220
Operating System + Version: Win10 Pro
Python Version (if applicable): 3.6
TensorFlow Version (if applicable): 2

Relevant Files

TF1.15 vs TF2 screenshots:

Steps To Reproduce

The repo I am working with https://github.com/tensorflow/models/tree/master/research/object_detection

pipeline config:

# SSD with Mobilenet v2
# Trained on COCO17, initialized from Imagenet classification checkpoint
# Train on TPU-8
#
# Achieves 22.2 mAP on COCO17 Val

model {
  ssd {
    inplace_batchnorm_update: true
    freeze_batchnorm: false
    num_classes: 2
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    encode_background_as_zeros: true
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.2
        max_scale: 0.95
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.3333
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    box_predictor {
      convolutional_box_predictor {
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.8
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
        class_prediction_bias_init: -4.6
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            random_normal_initializer {
              stddev: 0.01
              mean: 0.0
            }
          }
          batch_norm {
            train: true,
            scale: true,
            center: true,
            decay: 0.97,
            epsilon: 0.001,
          }
        }
      }
    }
    feature_extractor {
      type: 'ssd_mobilenet_v2_keras'
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          train: true,
          scale: true,
          center: true,
          decay: 0.97,
          epsilon: 0.001,
        }
      }
      override_base_feature_extractor_hyperparams: true
    }
    loss {
      classification_loss {
        weighted_sigmoid_focal {
          alpha: 0.75,
          gamma: 2.0
        }
      }
      localization_loss {
        weighted_smooth_l1 {
          delta: 1.0
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    normalize_loc_loss_by_codesize: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  fine_tune_checkpoint_version: V2
  fine_tune_checkpoint: "legacy/ssd_mobilenet_v2_320x320_coco17/checkpoint/ckpt-0"
  fine_tune_checkpoint_type: "detection"
  batch_size: 12
  sync_replicas: true
  startup_delay_steps: 0
  replicas_to_aggregate: 8
  num_steps: 70000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        cosine_decay_learning_rate {
          learning_rate_base: .8
          total_steps: 70000
          warmup_learning_rate: 0.13333
          warmup_steps: 2000
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
}

train_input_reader: {
  label_map_path: "legacy/training/object-detection.pbtxt"
  tf_record_input_reader {
    input_path: "legacy/data/train.record"
  }
}

eval_config: {
  metrics_set: "coco_detection_metrics"
  retain_original_images: false
  use_moving_averages: false
  num_visualizations: 45
  min_score_threshold: 0.35
  max_evals: 10
}

eval_input_reader: {
  label_map_path: "legacy/training/object-detection.pbtxt"
  shuffle: false
  num_epochs: 1
  tf_record_input_reader {
    input_path: "legacy/data/test.record"
  }
}

UPDATE: I have investigated further and discovered that the tensorboard settings are being set in https://github.com/tensorflow/models/blob/master/research/object_detection/model_lib.py for TF1.15 and https://github.com/tensorflow/models/blob/master/research/object_detection/model_lib_v2.py for TF2

So if someone who knows more than I do about this could work out what the difference is and what I need to do to get same result in tensorboard with v2 as I do with the first one that would be amazing and save me enormous headache. It would seem that this, even though it is documented as being for TF2, is not actually following TF2 syntax but I could be wrong.

NVES · March 24, 2021, 11:07pm

Hi,
Please check the below link, as they might answer your concerns

Thanks!

TP2049 · March 24, 2021, 11:21pm

thanks but that has absolutely nothing to do with my quesiton.

kayccc · March 25, 2021, 12:52am

Here is the Jetson Nano forum, but issue is TF o P220 which is out of our support scope.
Please try to check if can get support from Newest 'tensorflow' Questions - Stack Overflow

AastaLLL · March 26, 2021, 3:07am

Hi,

The summary API (tensorboard) has some update in the TensorFlow v2.
Which branch do you use? Is it master?

Could you help to verify if this change (add summary to model_main_tf2.py) is included in your source or not first?

https://github.com/tensorflow/models/commit/ee3bfa1ecf5c0d2296bb5d32e7b34d9e0b4b0205

Thanks.

TP2049 · March 26, 2021, 3:41am

Thank You,

I do have the updated version yes. I am wondering if this is to do with the fact TF2 uses eager mode? As the only place I can find where it is writing image summary of 3 samples is in def eager_train_step( in line 275 of model_lib_v2.py. I read somewhere training in eager mode doesn’t draw scalars and detection images in tensorboard but does store it still? I don’t know if this is true or what to do to remedy it.

The event files (.v2) are very large, about 1 gig per 10k steps. So it must be storing something…

the good news is the training works and the model performs well in inferencing. I just need the visuals in tensorboard to optimize it.

Thanks for your help!

AastaLLL · April 13, 2021, 11:35am

Hi,

Sorry for the late.
Based on the source, the summary do add in the eager training mode:

github.com

tensorflow/models/blob/master/research/object_detection/model_lib_v2.py#L311


      
          
          
detection_model._is_training = is_training  # pylint: disable=protected-access
          tf.keras.backend.set_learning_phase(is_training)
          
          
labels = model_lib.unstack_batch(
              labels, unpad_groundtruth_tensors=unpad_groundtruth_tensors)
          
          
with tf.GradientTape() as tape:
            losses_dict, _ = _compute_losses_and_predictions_dicts(
                detection_model, features, labels,
                training_step=training_step,
                add_regularization_loss=add_regularization_loss)
          
          
  losses_dict = normalize_dict(losses_dict, num_replicas)
          
          
trainable_variables = detection_model.trainable_variables
          
          
total_loss = losses_dict['Loss/total_loss']
          gradients = tape.gradient(total_loss, trainable_variables)
          
          
if clip_gradients_value:

Have you checked this issue with TensorFlow team?
They may know more about the difference in the summary API.

Thanks.

onjayuk · July 5, 2021, 9:27pm

Hi,

You can run another cmd with the checkpoint_dir parameter added whilst the training is running, this will add the visuals too,

py model_main_tf2.py --model_dir=models\ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8 --pipeline_config_path=models\ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8\pipeline.config --checkpoint_dir=models\ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8\

Thanks

Topic		Replies	Views
TX2 Tensorflow 1.10 Training error Jetson TX2	10	1699	September 20, 2018
Overfitting models by using tensorboard TAO Toolkit	6	894	October 12, 2021
Integrating Tensorboard images into detectNet SSD training Jetson Nano tensorrt , tensorflow , jetson-inference	6	982	October 15, 2021
How to Visualizing the Retraining with TensorBoard Jetson TX2	11	1903	October 18, 2021
TensorFlow object detection and image classification accelerated for NVIDIA Jetson Jetson TX2	25	10756	June 3, 2019
Output file when training detectnet 2 TAO Toolkit	4	640	October 12, 2021
Missing TensorBoard support TAO Toolkit	4	1074	October 12, 2021
Tensorflow object detection api Frameworks (archived) tensorflow	0	380	May 4, 2020
Performance of Tensorflow (1.5) on Jetson TX2 slower than expected Jetson TX2	3	2865	October 18, 2021
Tensorflow 2 to TensorRT Jetson Nano tensorrt	2	696	October 15, 2021

Tensorboard not displaying all info in TF2 Object Detection API Training

Description

Environment

Relevant Files

Steps To Reproduce

Related topics