Unable to retrain pre-trained SSD Mobilenet v1

I am trying to retrain the tlt_pretrained_object_detection:mobilenet_v1 model with my own KITTI-formatted dataset (1 class, “person”), as per the instructions in the Getting Started Guide and this blog post. I am using the tlt-streamanalytics:v2.0_dp_py2 Docker image for this.

Firstly, I am converting the KITTI dataset into TFrecords with the following command:
tlt-dataset-convert -d convert.spec -o ./tfrecords/converted.tfrecord
and this convert.spec file:

kitti_config {
  root_directory_path: "[REPLACE_WITH_DATASET_DIR]"
  image_dir_name: "images"
  label_dir_name: "labels"
  image_extension: ".png"
  partition_mode: "random"
  num_partitions: 2
  val_split: 20
  num_shards: 10
}
image_directory_path: "[REPLACE_WITH_DATASET_DIR]"

The [REPLACE_WITH_DATASET_DIR] is replaced with my actual dataset directory in my spec file.
The output of that command is:

Using TensorFlow backend.
2020-08-05 20:19:08,221 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2020-08-05 20:19:08,243 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 7955     Val: 1988
2020-08-05 20:19:08,243 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2020-08-05 20:19:08,245 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
/usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:266: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
2020-08-05 20:19:08,402 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 1
2020-08-05 20:19:08,552 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 2
2020-08-05 20:19:08,702 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 3
2020-08-05 20:19:08,853 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 4
2020-08-05 20:19:09,003 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 5
2020-08-05 20:19:09,153 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 6
2020-08-05 20:19:09,303 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 7
2020-08-05 20:19:09,454 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 8
2020-08-05 20:19:09,604 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 9
2020-08-05 20:19:09,760 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
person: 1988

2020-08-05 20:19:09,760 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 0
2020-08-05 20:19:10,363 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 1
2020-08-05 20:19:10,965 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 2
2020-08-05 20:19:11,567 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 3
2020-08-05 20:19:12,170 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 4
2020-08-05 20:19:12,772 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 5
2020-08-05 20:19:13,375 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 6
2020-08-05 20:19:13,978 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 7
2020-08-05 20:19:14,581 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 8
2020-08-05 20:19:15,184 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 9
2020-08-05 20:19:15,791 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
person: 7955

2020-08-05 20:19:15,792 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Cumulative object statistics
2020-08-05 20:19:15,792 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
person: 9943

2020-08-05 20:19:15,792 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Class map.
Label in GT: Label in tfrecords file
person: person
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2020-08-05 20:19:15,792 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Tfrecords generation complete.

After that, I download the model with this command:
ngc registry model download-version nvidia/tlt_pretrained_object_detection:mobilenet_v1 -d ./pretrained_model

Then, I try and train with this command:
tlt-train ssd -e train.spec -r ./pretrained_model --gpus 1 -k $NGC_API_KEY
and this train.spec file:

training_config {
  batch_size_per_gpu: 32
  num_epochs: 120
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-6
      max_learning_rate: 5e-4
      soft_start: 0.1
      annealing: 0.7
    }
  }
  regularizer {
    type: L1
    weight: 3e-9
  }
}
eval_config {
  validation_period_during_training: 10
  average_precision_mode: SAMPLE
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.01
  top_k: 200
}
augmentation_config {
  preprocessing {
    output_image_width: 224
    output_image_height: 224
    output_image_channel: 3
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {

    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    color_shift_stddev: 0.0
    hue_rotation_max: 25.0
    saturation_shift_max: 0.2
    contrast_scale_max: 0.1
    contrast_center: 0.5
  }
}
dataset_config {
  data_sources: {
    tfrecords_path: "[REPLACE_WITH_DATASET_DIR]/tfrecords/*"
    image_directory_path: "[REPLACE_WITH_DATASET_DIR]/images"
  }
  image_extension: "png"
  target_class_mapping {
      key: "person"
      value: "person"
  }
  validation_fold: 0
}
ssd_config {
  aspect_ratios_global: "[1.0, 2.0, 0.5, 3.0, 0.33]"
  aspect_ratios: "[[1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0, 2.0, 0.5, 3.0, 0.33]]"
  two_boxes_for_ar1: true
  clip_boxes: false
  scales: "[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]"
  loss_loc_weight: 1.0
  focal_loss_alpha: 0.25
  focal_loss_gamma: 2.0
  variances: "[0.1, 0.1, 0.2, 0.2]"
  arch: "mobilenet_v1"
  freeze_bn: false
}

The [REPLACE_WITH_DATASET_DIR] is replaced with my actual dataset directory in my spec file.
The output of this command is:

Using TensorFlow backend.
--------------------------------------------------------------------------
[[14866,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: 2b629d1e986b

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
2020-08-05 20:27:49,098 [INFO] /usr/local/lib/python2.7/dist-packages/iva/ssd/utils/spec_loader.pyc: Merging specification from train.spec
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 37, in main
  File "./ssd/scripts/train.py", line 245, in main
  File "./ssd/scripts/train.py", line 96, in run_experiment
  File "./ssd/builders/inputs_builder.py", line 51, in __init__
  File "./detectnet_v2/dataloader/default_dataloader.py", line 206, in get_dataset_tensors
  File "./detectnet_v2/dataloader/default_dataloader.py", line 232, in _generate_images_and_ground_truth_labels
  File "./modulus/processors/processors.py", line 227, in __call__
  File "./detectnet_v2/dataloader/utilities.py", line 60, in call
  File "./modulus/processors/tfrecords_iterator.py", line 143, in process_records
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1508, in split
    axis=axis, num_split=num_or_size_splits, value=value, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 8883, in split
    "Split", split_dim=axis, value=value, num_split=num_split, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 709, in _apply_op_helper
    (key, op_type_name, attr_value.i, attr_def.minimum))
ValueError: Attr 'num_split' of 'Split' Op passed 0 less than minimum 1.

I can’t really tell what’s going on behind the scenes in the TLT code, so I have no idea what this means for me or how to debug it. How can I fix this so I can retrain this model with my own dataset?

Thank you in advance for your help.

Your command seems to be missing model. Could you double check?

tlt-train ssd -e train.spec -r ./pretrained_model --gpus 1 -k $NGC_API_KEY

Also, please try to add below line in eval_config.

batch_size: 16

1 Like

Perfect, that fixed it! I had to add batch_size: 16, and then that error disappeared.

I also did not realise I had to pass -r ./pretrained_model/tlt_pretrained_object_detection_vmobilenet_v1 instead of -r ./pretrained_model.

With both of those changes, I am successfully training. Thank you very much for your help!