No detections after training PeopleNet using custom labeled data

I am trying to improve the performance of the PeopleNet model using around 1300 labeled 1920x1080 png images.

I have used the following command
tlt-train detectnet_v2 -k tlt_encode -r /workspace/tlt-experiments/ -e train.txt

My train.txt file is:

random_seed: 42
model_config {
  num_layers: 18
  pretrained_model_file: "/workspace/tlt-experiments/resnet34_peoplenet.tlt"
  use_batch_norm: true
  objective_set {
    bbox {
      scale: 35.0
      offset: 0.5
    }
    cov {
    }
  }
  training_precision {
    backend_floatx: FLOAT32
  }
  arch: "resnet"
  all_projections: true
}
# Sample rasterizer configs to instantiate a 3 class bbox rasterizer
bbox_rasterizer_config {
  target_class_config {
    key: "person"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  deadzone_radius: 0.67
}
postprocessing_config {
  target_class_config {
    key: "person"
    value: {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 20
      }
    }
  }
}
cost_function_config {
  target_classes {
    name: "person"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  enable_autoweighting: True
  max_objective_weight: 0.9999
  min_objective_weight: 0.0001
}
training_config {
  batch_size_per_gpu: 8
  num_epochs: 80
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-6
      max_learning_rate: 5e-4
      soft_start: 0.1
      annealing: 0.7
    }
  }
  regularizer {
    type: L1
    weight: 3e-9
  }
  optimizer {
    adam {
      epsilon: 1e-08
      beta1: 0.9
      beta2: 0.999
    }
  }
  cost_scaling {
    enabled: False
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
}
# Sample augementation config for 
augmentation_config {
  preprocessing {
    output_image_width: 960
    output_image_height: 544
    output_image_channel: 3
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {

    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    color_shift_stddev: 0.0
    hue_rotation_max: 25.0
    saturation_shift_max: 0.2
    contrast_scale_max: 0.1
    contrast_center: 0.5
  }
}
evaluation_config {
  average_precision_mode: INTEGRATE
  validation_period_during_training: 10
  first_validation_epoch: 1
  minimum_detection_ground_truth_overlap {
    key: "person"
    value: 0.5
  }
  evaluation_box_config {
    key: "person"
    value {
      minimum_height: 4
      maximum_height: 9999
      minimum_width: 4
      maximum_width: 9999
    }
  }
}
dataset_config {
  data_sources: {
    tfrecords_path: "/workspace/tlt-experiments/tf_records/*"
    image_directory_path: "/workspace/tlt-experiments/"
  }
  image_extension: "png"
  target_class_mapping {
      key: "person"
      value: "person"
  }
  validation_fold: 0
}

The results of training at 80 epochs are:

Epoch 80/80
=========================

Validation cost: 0.000043
Mean average_precision (in %): 98.6850

class name      average precision (in %)
------------  --------------------------
person                            98.685

Median Inference Time: 0.013576

Understand that for deployment I would use prune but just wanted to check accuracy on site camera so used the below to export the model for deepstream 5.0dp:

tlt-export detectnet_v2 -m /workspace/tlt-experiments/weights/model.tlt -o /workspace/tlt-experiments/weights/peoplenet_detector_unpruned.etlt -k tlt_encode

I get the following:

Using TensorFlow backend.
NOTE: UFF has been tested with TensorFlow 1.14.0.
WARNING: The version of TensorFlow installed on this system is not guaranteed to work with UFF.
DEBUG [/usr/lib/python2.7/dist-packages/uff/converters/tensorflow/converter.py:96] Marking ['output_cov/Sigmoid', 'output_bbox/BiasAdd'] as outputs
[TensorRT] INFO: Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[TensorRT] INFO: Detected 1 inputs and 2 output network tensors.

Then I use the same deepstream code that was running the standard PeopleNet model but with changes as shown below:

[property]
gpu-id=0
net-scale-factor=0.0039215697906911373
tlt-model-key=tlt_encode
tlt-encoded-model=/home/ddi/Social%20Distancing/CampsieRSL/deepstream/dev/local-testing/peoplenet_detector_unpruned.etlt
#tlt-encoded-model=/opt/nvidia/deepstream/deepstream-5.0/samples/models/tlt_pretrained_models/peoplenet/resnet34_peoplenet_pruned.etlt
labelfile-path=labels_peoplenet.txt
#model-engine-file=/opt/nvidia/deepstream/deepstream-5.0/samples/models/tlt_pretrained_models/peoplenet/resnet34_peoplenet_pruned.etlt_b1_gpu0_fp16.engine
input-dims=3;544;960;0
uff-input-blob-name=input_1
batch-size=1
process-mode=1
model-color-format=0
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=2
num-detected-classes=1
cluster-mode=1
interval=0
gie-unique-id=1
output-blob-names=output_bbox/BiasAdd;output_cov/Sigmoid

[class-attrs-all]
pre-cluster-threshold=0.2
## Set eps=0.7 and minBoxes for cluster-mode=1(DBSCAN)
eps=0.7
minBoxes=1

Have changed the number of classes to just 1 for person, label file is just person now. When I run the deepstream app it does not detect a single person unless I put pre-cluser-threshold<0.1 which gives mostly false positives.

Have I missed something? Does it matter that I am only using 1 class? Does the traning image need to be 960x544 and not 1920x1080?

Can you run tlt-infer against the test dataset firstly? To check its output folder.

More,

  1. It is necessary to resize your images/labels to 960x544 offline. Or you can keep your images/labels, but need to set to 1920x1088 in the spec. The width and height should be multiple of 16.
  2. Need to change

num_layers: 18

to

num_layers: 34

Yeah the results of tlt-infer were very poor.

I will try keep labels and images and change spec to 1920x1088 in spec and layers to 34 and retrain and see.

If that doesnt work I will rezise to 960x544

So ran multiple tests of training using 1920x1088 and 960x544 changing layers to 34. Both had worse performance than the normal PeopleNet model even if using unpruned. Is there anything else I need to change in my train.txt to improve performance or is it a case of more data and a mix of data. Also is it recommended to use images with no people/labels to improve accuracy?

I observe that during your training, it can get a high mAP result(98.6850). This result is generated by tlt-evaluate.
So, please run a quick test. Use tlt-infer to run inference against the same val dataset(it should be part 0 of your /workspace/tlt-experiments/tf_records/* because you set validation_fold: 0 in the spec ). To see if tlt-infer can get the same mAP result as tlt-evaluate.