PeopleNet v1.0 unpruned model shows very bad results on COCO dataset

Hi, I downloaded NVIDIA peopleNet unpruned model from https://ngc.nvidia.com/catalog/models/nvidia:tlt_peoplenet, can install nvidia docker container v2.0_dp_py, then use DetectNet V2 to do inference on COCO Dataset person detection. For 1800 COCO dataset (resized to 960*544), with a parameter of aggregate_cov = 0.9, DBSCAN_eps = 0.3, I can only get 0.7 precison, and 0.486 recall. When I look into the bad cases, I find some horse/elephant/vegetables/… are predicted to be person and quite a lot people on the images are neglected. Are these results resonable, since peopleNet is trained based on more than 5 million person objects? Are these due to mismatch of training dataset (millions of training data are quite different with coco dataset, say no animals shown)?

Could you please share your inference spec file?

For training data of peoplenet, yes, there are no animals.

Thanks for reply, here is my spec file. I only care about person object, so I only do a detailed setting for it.

inferencer_config{

defining target class names for the experiment.

Note: This must be mentioned in order of the networks classes.

target_classes: “person”
target_classes: “bag”
target_classes: “face”

Inference dimensions.

image_width: 960
image_height: 544

Must match what the model was trained for.

image_channels: 3
batch_size: 16
gpu_index: 0

model handler config

tlt_config{
model: “/workspace/data/resnet34_peoplenet.tlt”
}
}
bbox_handler_config{
kitti_dump: true
disable_overlay: false
overlay_linewidth: 2
classwise_bbox_handler_config{
key:“person”
value: {
confidence_model: “aggregate_cov”
output_map: “person”
confidence_threshold: 0.9
bbox_color{
R: 0
G: 255
B: 0
}
clustering_config{
coverage_threshold: 0.005
dbscan_eps: 0.3
dbscan_min_samples: 0.05
minimum_bounding_box_height: 4
}
}
}
classwise_bbox_handler_config{
key:“default”
value: {
confidence_model: “mean_cov”
output_map: “default”
confidence_threshold: 0.5
bbox_color{
R: 255
G: 0
B: 0
}
clustering_config{
coverage_threshold: 0.00
dbscan_eps: 0.4
dbscan_min_samples: 0.05
minimum_bounding_box_height: 4
}
}
}
}

Hi xzhka1229,
This case is related to different dataset. The peoplenet is not trained via COCO dataset. For COCO dataset, user can use the peoplenet unpruned tlt model as a pretrained model, then trigger training.
That’s also the reason why tlt provides the unpruned peoplenet model. Users can do transfer learning against their own data.

Thanks for your time,

Could you please share your test data for me to try? if not, do you have a similar dataset recommended for me to use (is kitti left color images more similar to your dataset) ? I want to reproduce the good performance of peopleNet.

Thanks a lot.

I run tlt-infer against parts of NV internal data only. It works fine.
Some images contain persons who have bags or luggage in open-parking ground.
You can search some via google. Sorry for the inconvenient.

For example, you can download the image from https://ngc.nvidia.com/catalog/models/nvidia:tlt_peoplenet

$ wget https://developer.nvidia.com/sites/default/files/akamai/NGC_Images/models/peoplenet/input_11ft45deg_000070.jpg

Then run tlt-infer against it. You will get the same result as shown in the https://developer.nvidia.com/sites/default/files/akamai/NGC_Images/models/peoplenet/output_11ft45deg_000070.jpg

Thanks, I did download the image and tested, what I get is the same as demo, but there is only one image, so the accuracy cannot be convincing. I am at the learning stage of people detection, hence I want to compare different models on different dataset, and use peopleNet as a benchmark.

Sorry to bother again, can the tlt-evaluate automatically evaluate kitti dataset? since tlt-infer gives me labels but not accuracy/precision/recall… If yes, which spec file or command I should use?

Thanks

To evaluate kitti dataset with tlt-evaluate, please refer to the jupyter notebook inside the docker.
To evaluate the peoplenet with tlt-evaluate, refer to People Net -
More, one spec for reference.

random_seed: 42
dataset_config {
data_sources {
tfrecords_path: “your tfrecord”
image_directory_path: “your own image”
}
image_extension: “jpg”
target_class_mapping {
key: “person”
value: “Person”
}
target_class_mapping {
key: “Person”
value: “Person”
}
target_class_mapping {
key: “rider”
value: “Person”
}
target_class_mapping {
key: “Rider”
value: “Person”
}
target_class_mapping {
key: “personal_bag”
value: “Bag”
}
target_class_mapping {
key: “rolling_bag”
value: “Bag”
}
target_class_mapping {
key: “face”
value: “Face”
}

validation_fold: 0
}
augmentation_config {
preprocessing {
output_image_width: 960
output_image_height: 544
crop_right: 960
crop_bottom: 544
min_bbox_width: 1.0
min_bbox_height: 1.0
output_image_channel: 3
}
spatial_augmentation {
hflip_probability: 0.5
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
}
}
postprocessing_config {
target_class_config {
key: “Person”
value {
clustering_config {
coverage_threshold: 0.00499999988824
dbscan_eps: 0.20000000298
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 4
}
}
}
target_class_config {
key: “Bag”
value {
clustering_config {
coverage_threshold: 0.00499999988824
dbscan_eps: 0.15000000596
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 4
}
}
}
target_class_config {
key: “Face”
value {
clustering_config {
coverage_threshold: 0.00499999988824
dbscan_eps: 0.15000000596
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 4
}
}
}
}
model_config {
pretrained_model_file: “/workspace/its/peoplenet/resnet34_peoplenet.tlt”
num_layers: 34
load_graph: True
use_batch_norm: False
activation {
activation_type: “relu”
}
objective_set {
bbox {
scale: 35.0
offset: 0.5
}
cov {
}
}
training_precision {
backend_floatx: FLOAT32
}
arch: “resnet”
}
evaluation_config {
validation_period_during_training: 1
first_validation_epoch: 1
minimum_detection_ground_truth_overlap {
key: “Person”
value: 0.699999988079
}
minimum_detection_ground_truth_overlap {
key: “Bag”
value: 0.5
}
minimum_detection_ground_truth_overlap {
key: “Face”
value: 0.5
}
evaluation_box_config {
key: “Person”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
evaluation_box_config {
key: “Bag”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
evaluation_box_config {
key: “Face”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
average_precision_mode: INTEGRATE
}
cost_function_config {
target_classes {
name: “Person”
class_weight: 1.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
target_classes {
name: “Bag”
class_weight: 8.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 1.0
}
}
target_classes {
name: “Face”
class_weight: 4.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
enable_autoweighting: true
max_objective_weight: 0.999899983406
min_objective_weight: 9.99999974738e-05
}
training_config {
batch_size_per_gpu: 16
num_epochs: 10
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 10e-10
max_learning_rate: 10e-10
soft_start: 0.0
annealing: 0.3
}
}
regularizer {
type: L1
weight: 3.00000002618e-09
}
optimizer {
adam {
epsilon: 9.99999993923e-09
beta1: 0.899999976158
beta2: 0.999000012875
}
}
cost_scaling {
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}
checkpoint_interval: 10
}
bbox_rasterizer_config {
target_class_config {
key: “Person”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.40000000596
cov_radius_y: 0.40000000596
bbox_min_radius: 1.0
}
}
target_class_config {
key: “Bag”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 1.0
cov_radius_y: 1.0
bbox_min_radius: 1.0
}
}
target_class_config {
key: “Face”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 1.0
cov_radius_y: 1.0
bbox_min_radius: 1.0
}
}
deadzone_radius: 0.400000154972
}

Another thing, when i use tlt-infer the labels output (kitti dump) have no scores inside. Could you please guide how to get the score of each box?

Btw, I used tlt-evaluate to do evaluation of kitti training dataset using unpruned peopleNet, I got 0% output. If i tlt-infer first and evaluate myself, I got 53% precision.

Here is my spec file:
random_seed: 42
dataset_config {
data_sources {
tfrecords_path: “/workspace/data/kitti/tfrecords/kitti_trainval/*”
image_directory_path: “/workspace/data/kitti/training”
}
image_extension: “png”
target_class_mapping {
key: “car”
value: “car”
}
target_class_mapping {
key: “cyclist”
value: “cyclist”
}
target_class_mapping {
key: “pedestrian”
value: “person”
}
target_class_mapping {
key: “person_sitting”
value: “person”
}
target_class_mapping {
key: “van”
value: “car”
}
validation_fold: 0
}
augmentation_config {
preprocessing {
output_image_width: 960
output_image_height: 544
min_bbox_width: 1.0
min_bbox_height: 1.0
output_image_channel: 3
}
spatial_augmentation {
hflip_probability: 0.5
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
}
}
postprocessing_config {
target_class_config {
key: “car”
value {
clustering_config {
coverage_threshold: 0.00499999988824
dbscan_eps: 0.20000000298
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 20
}
}
}
target_class_config {
key: “cyclist”
value {
clustering_config {
coverage_threshold: 0.00499999988824
dbscan_eps: 0.15000000596
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 20
}
}
}
target_class_config {
key: “person”
value {
clustering_config {
coverage_threshold: 0.00749999983236
dbscan_eps: 0.230000004172
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 20
}
}
}
}
model_config {
#pretrained_model_file: “/workspace/tlt-experiments/detectnet_v2/experiment_dir_pruned
#/resnet18_nopool_bn_detectnet_v2_pruned.tlt”
pretrained_model_file: “/workspace/data/resnet34_peoplenet.tlt”
num_layers: 34
use_batch_norm: false
load_graph: true
objective_set {
bbox {
scale: 35.0
offset: 0.5
}
cov {
}
}
training_precision {
backend_floatx: FLOAT32
}
arch: “resnet”
}
evaluation_config {
validation_period_during_training: 10
first_validation_epoch: 30
minimum_detection_ground_truth_overlap {
key: “car”
value: 0.699999988079
}
minimum_detection_ground_truth_overlap {
key: “cyclist”
value: 0.5
}
minimum_detection_ground_truth_overlap {
key: “person”
value: 0.5
}
evaluation_box_config {
key: “car”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
evaluation_box_config {
key: “cyclist”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
evaluation_box_config {
key: “person”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
average_precision_mode: INTEGRATE
}
cost_function_config {
target_classes {
name: “car”
class_weight: 1.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
target_classes {
name: “cyclist”
class_weight: 8.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 1.0
}
}
target_classes {
name: “person”
class_weight: 4.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
enable_autoweighting: true
max_objective_weight: 0.999899983406
min_objective_weight: 9.99999974738e-05
}

  1. For detectnet_v2, its tlt-infer result does not support showing the score of bbox yet.
  2. If you want to run tlt-evaluate against the pruned peoplenet model, please refer to my spec and the command in People Net -, and also please prepare some images whose labels are People or Bag or Face.
    The KITTI dataset does not contain the same class. So you get the mAP=0
  3. Please paste your command and log for “If i tlt-infer first and evaluate myself, I got 53% precision.”