FP16 does not improve latency of ssd_mobilenet_v2_coco

Hi, I’m not seeing an improvement in latency after FP16 conversion of ssd_mobilenet_v2_coco. I get a latency of 66.16 ms (per batch of 8 samples) for the FP16 model, and 66.67 ms for the original model. Is this expected?

I get these latencies by running TF-TRT’s benchmarking function (https://github.com/tensorflow/tensorrt/tree/master/tftrt/examples/object_detection)

python -m tftrt.examples.object_detection.test my_test.json

in NVIDIA’s docker container nvcr.io/nvidia/tensorflow:19.07-py3. Here are the logs (for FP16) showing that TensorRT nodes were created:

2019-08-09 01:23:28.068787: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:739] 2019-08-09 01:29:11.032378: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:739] Optimization results for grappler item: tf_graph
2019-08-09 01:29:11.032408: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741]   constant folding: Graph size after: 5602 nodes (-1319), 9241 edges (-1463), time = 478.471ms.
2019-08-09 01:29:11.032412: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741]   layout: Graph size after: 5629 nodes (27), 9279 edges (38), time = 140.961ms.
2019-08-09 01:29:11.032415: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741]   constant folding: Graph size after: 5629 nodes (0), 9279 edges (0), time = 175.792ms.
2019-08-09 01:29:11.032418: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741]   TensorRTOptimizer: Graph size after: 4520 nodes (-1109), 8048 edges (-1231), time = 8011.08496ms.
graph_size(MB)(native_tf): 66.2
graph_size(MB)(trt): 131.1
num_nodes(native_tf): 6921
num_nodes(tftrt_total): 4520
num_nodes(trt_only): 278
time(s) (trt_conversion): 10.8548

And here is the json file I’m using:

{
  "model_config": {
    "model_name": "ssd_mobilenet_v2_coco",
    "input_dir": "./",
    "batch_size": null,
    "override_nms_score_threshold": 0.3,
    "output_path": "./ssd_mobilenet_v2_coco_2018_03_29/original_model.pb"
  },
  "optimization_config": {
    "use_trt": true,
    "precision_mode": "FP16",
    "calib_images_dir": "../coco/val2017",
    "num_calib_images": 8,
    "calib_batch_size": 8,
    "calib_image_shape": [640, 640],
    "max_workspace_size_bytes": 4000000000,
    "output_path": "./ssd_mobilenet_v2_coco_2018_03_29/optimized_model.pb"
  },
  "benchmark_config": {
    "images_dir": "../coco/val2017",
    "annotation_path": "../coco/annotations/instances_val2017.json",
    "batch_size": 8,
    "image_shape": [640, 640],
    "num_images": 1024,
    "output_path": "stats/ssd_mobilenet_v2_coco_optimized.json"
  }
}

Docker image: nvcr.io/nvidia/tensorflow:19.07-py3
GPU - 2080 Ti
NVIDIA driver version: 418.67
CUDA version: 10.1
CUDNN version: 7.6.1