Reduced Accuracy when importing SSD model from TLT into DeepStream

Hi,

I trained an SSD model in TLT and got an mAP during evaluation on test set of 70.0. When the same model was converted to an engine and loaded into DeepStream, the output was evaluated again using the COCO API to get AP@0.5 (same as that done in TLT) the mAP was 41.0.

Can you help out with figuring out what the problem is?

Here are the TLT specifications for SSD training:

ssd_config {
  aspect_ratios_global: "[1.0, 2.0, 0.5, 3.0, 1.0/3.0]"
  scales: "[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]"
  two_boxes_for_ar1: true
  clip_boxes: false
  loss_loc_weight: 0.8
  focal_loss_alpha: 0.25
  focal_loss_gamma: 2.0
  variances: "[0.1, 0.1, 0.2, 0.2]"
  arch: "resnet18"
  freeze_bn: false
  freeze_blocks: 0
}
training_config {
  batch_size_per_gpu: 16
  num_epochs: 220
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 5e-5
    max_learning_rate: 2e-2
    soft_start: 0.1
    annealing: 0.3
    }
  }
  optimizer {
    adam {
      epsilon: 1e-08
      beta1: 0.9
      beta2: 0.999
    }
  }
  regularizer {
    type: L1
    weight: 3.00000002618e-09
  }
}
eval_config {
  validation_period_during_training: 10
  average_precision_mode: SAMPLE
  batch_size: 32
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.01
  clustering_iou_threshold: 0.6
  top_k: 200
} 
augmentation_config {
  preprocessing {
    output_image_width: 960
    output_image_height: 540
    output_image_channel: 3
    crop_right: 960
    crop_bottom: 540
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {
    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 0.7
    zoom_max: 1.8
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.20000000298
    contrast_scale_max: 0.10000000149
    contrast_center: 0.5
  }
}
dataset_config {
  data_sources: {
    tfrecords_path: "/dfs/SSD_Benchmarking/tlt-experiments/ssd_tdcafe/tfrecords/tdcafe_trainval*"
    image_directory_path: "/dfs/SSD_Benchmarking/tlt-experiments/data"
  }
  image_extension: "jpg"
  target_class_mapping {
      key: "person"
      value: "person"
  }
validation_fold: 0
}

And the DeepStream configuration:

[application]
enable-perf-measurement=1
perf-measurement-interval-sec=1
gie-kitti-output-dir=output

[tiled-display]
enable=0
rows=1
columns=1
width=960
height=540
gpu-id=2
nvbuf-memory-type=0


[source0]
enable=1
type=2
uri=file:/dfs/fahad/coco_annotator/coco-annotator/datasets/MRCNN_TestSet_h264.mp4
gpu-id=2
cudadec-memtype=0


[streammux]
gpu-id=2
batch-size=1
batched-push-timeout=-1
width=1920
height=1080
nvbuf-memory-type=0


[sink0]
enable=1
type=1
output-file=file:/software/out_tuned_mrcnn.mp4
container=1
codec=3
sync=0
gpu-id=2
source-id=0


[osd]
enable=0
gpu-id=2
border-width=3
text-size=15
text-color=1;1;1;1;
text-bg-color=0.3;0.3;0.3;1
font=Serif
show-clock=0
clock-x-offset=800
clock-y-offset=820
clock-text-size=12
clock-color=1;0;0;0
nvbuf-memory-type=0

[primary-gie]
enable=1
gpu-id=2
batch-size=1
interval=0
model-engine-file=/software/tlt-experiments/ssd_tdcafe/engines_pruned_0.4_retrained_x4/ssd_resnet18_epoch_170_FP32.etlt_b1_fp32.engine
labelfile-path=/root/deepstream_sdk_v4.0.2_x86_64/sources/objectDetector_SSD/labels.txt
config-file=/root/deepstream_sdk_v4.0.2_x86_64/sources/objectDetector_SSD/pgie_ssd_uff_config_test.txt
nvbuf-memory-type=0

[tracker]
enable=0
tracker-width=320
tracker-height=180
ll-config-file=/root/deepstream_sdk_v4.0.2_x86_64/samples/configs/deepstream-app/iou_config.txt
ll-lib-file=/opt/nvidia/deepstream/deepstream-4.0/lib/libnvds_mot_iou.so
gpu-id=2
enable-batch-process=1


[tests]
file-loop=0

And the primary GIE configuration:

[property]
net-scale-factor=1.0
offsets=103.939;116.779;123.68
model-color-format=1
labelfile-path=/root/deepstream_sdk_v4.0.2_x86_64/sources/objectDetector_SSD/labels.txt
tlt-encoded-model=/software/tlt-experiments/ssd_tdcafe/engines_pruned_0.4_retrained_x4/ssd_resnet18_epoch_170_FP32.etlt
tlt-model-key=OHBuNWlvczJzMG41bmVuN2dscnZkdWk1NnQ6YjkyZGI1ZjQtM2EwYi00OWQxLTg1MzItODRkNGU0ZGU5ODU3
uff-input-dims=3;540;960;0
uff-input-blob-name=Input

network-mode=0
num-detected-classes=1
gie-unique-id=1
is-classifier=0
output-blob-names=NMS
parse-bbox-func-name=NvDsInferParseCustomSSDUff
custom-lib-path=/root/deepstream_sdk_v4.0.2_x86_64/sources/objectDetector_SSD/nvdsinfer_customparser_ssd_uff/libnvds_infercustomparser_ssd_uff.so

[class-attrs-all]
threshold=0.3
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=0
detected-min-h=0
detected-max-w=0
detected-max-h=0

What do you mean by “using the COCO API”?

More, please use the same data when you run “tlt-infer” and run DS inference.
That means you can run “tlt-infer” to do inference against your test set.
Then use the test set to generate h264 file.
For example,

  • copy 1 images for 100 times.
  • generate 20fps yuv420p h264 file
    $ ffmpeg -framerate 20 -i “%d.png” -vcodec h264 -b:v 10485760 -pix_fmt yuv420p yuv420p_20fps.h264

And run DS inference.

By COCO API, I mean using the COCO Python API for evaluating according to the evaluation metrics - Link to the API

Also, I did exactly that. The images used in tlt-evaluate were converted into an h264 video using ffmpeg and given as input video to DeepStream. The output metadata was then evaluated using the above mentioned API and it showed reduced accuracy.

Please trigger below experiments excluding COCO API.

  1. Run tlt-infer against your test set
  2. Generate h264 file based on your test set. Run DS inference.

To check if item1 has the same inference result as item2.