Hi, I’m not seeing an improvement in latency after FP16 conversion of ssd_mobilenet_v2_coco. I get a latency of 66.16 ms (per batch of 8 samples) for the FP16 model, and 66.67 ms for the original model. Is this expected?
I get these latencies by running TF-TRT’s benchmarking function (https://github.com/tensorflow/tensorrt/tree/master/tftrt/examples/object_detection)
python -m tftrt.examples.object_detection.test my_test.json
in NVIDIA’s docker container nvcr.io/nvidia/tensorflow:19.07-py3. Here are the logs (for FP16) showing that TensorRT nodes were created:
2019-08-09 01:23:28.068787: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:739] 2019-08-09 01:29:11.032378: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:739] Optimization results for grappler item: tf_graph
2019-08-09 01:29:11.032408: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741] constant folding: Graph size after: 5602 nodes (-1319), 9241 edges (-1463), time = 478.471ms.
2019-08-09 01:29:11.032412: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741] layout: Graph size after: 5629 nodes (27), 9279 edges (38), time = 140.961ms.
2019-08-09 01:29:11.032415: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741] constant folding: Graph size after: 5629 nodes (0), 9279 edges (0), time = 175.792ms.
2019-08-09 01:29:11.032418: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741] TensorRTOptimizer: Graph size after: 4520 nodes (-1109), 8048 edges (-1231), time = 8011.08496ms.
graph_size(MB)(native_tf): 66.2
graph_size(MB)(trt): 131.1
num_nodes(native_tf): 6921
num_nodes(tftrt_total): 4520
num_nodes(trt_only): 278
time(s) (trt_conversion): 10.8548
And here is the json file I’m using:
{
"model_config": {
"model_name": "ssd_mobilenet_v2_coco",
"input_dir": "./",
"batch_size": null,
"override_nms_score_threshold": 0.3,
"output_path": "./ssd_mobilenet_v2_coco_2018_03_29/original_model.pb"
},
"optimization_config": {
"use_trt": true,
"precision_mode": "FP16",
"calib_images_dir": "../coco/val2017",
"num_calib_images": 8,
"calib_batch_size": 8,
"calib_image_shape": [640, 640],
"max_workspace_size_bytes": 4000000000,
"output_path": "./ssd_mobilenet_v2_coco_2018_03_29/optimized_model.pb"
},
"benchmark_config": {
"images_dir": "../coco/val2017",
"annotation_path": "../coco/annotations/instances_val2017.json",
"batch_size": 8,
"image_shape": [640, 640],
"num_images": 1024,
"output_path": "stats/ssd_mobilenet_v2_coco_optimized.json"
}
}
Docker image: nvcr.io/nvidia/tensorflow:19.07-py3
GPU - 2080 Ti
NVIDIA driver version: 418.67
CUDA version: 10.1
CUDNN version: 7.6.1