No improvements from TensorRT on NVIDIA-AI-IOT/tf_trt_models

I can’t get any improvements from TensorRT on Drive PX 2 AutoChauffeur (P2379, the one without dGPU). I simply clonned your Jetson example from https://github.com/NVIDIA-AI-IOT/tf_trt_models and created a benchmark.py script, which is not much but a copy-paste from https://github.com/NVIDIA-AI-IOT/tf_trt_models/blob/master/examples/detection/detection.ipynb. Since Jetson TX2 has similar specs as one node my Drive PX 2, I expected similar values and improvements as shown in the table at https://github.com/NVIDIA-AI-IOT/tf_trt_models#models-1
Unfortunately, in my case I see no difference in inference speed between the original models and TensorRT ones (I could even argue there’s a slight drop in performance). Here’s what I see (full logs below):

ssd_mobilenet_v1_coco: Original - 0.051792s, TRT - 0.053618s
ssd_mobilenet_v2_coco: Original - 0.084560s, TRT - 0.093455s
ssd_inception_v2_coco: Original - 0.100977s, TRT - 0.106853s

Taking a closer look, it seems that TRT slims down the graph by ~1000 nodes but fails to put anything to TRTEngineOp:

ssd_mobilenet_v1_coco: Original - 7571 nodes, TRT - 6518 nodes out of which 0 are TRTEngineOp
ssd_mobilenet_v2_coco: Original - 8062 nodes, TRT - 6865 nodes out of which 0 are TRTEngineOp
ssd_inception_v2_coco: Original - 8278 nodes, TRT - 7015 nodes out of which 0 are TRTEngineOp

I see errors like the following one in the logs as well:

Engine my_trt_op_0 creation for segment 0, composed of 434 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/FusedBatchNorm. Skipping...

What’s wrong? How to make TensorRT work?

My configuration

TensorFlow 1.12.0 built from sources with TRT support.
Protobuf updated according to https://devtalk.nvidia.com/default/topic/1046492/tensorrt/extremely-long-time-to-load-trt-optimized-frozen-tf-graphs/post/5315675/#5315675

$ protoc --version
libprotoc 3.6.1

TensorRT config:

$ dpkg -l | grep nvinfer
ii  libnvinfer-dev                             4.1.1-1+cuda9.2                               arm64        TensorRT development libraries and headers
ii  libnvinfer-samples                         4.1.1-1+cuda9.2                               arm64        TensorRT samples and documentation
ii  libnvinfer4                                4.1.1-1+cuda9.2                               arm64        TensorRT runtime libraries

GPU data:

$ /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery
/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X2"
  CUDA Driver Version / Runtime Version          9.2 / 9.2
  CUDA Capability Major/Minor version number:    6.2
  Total amount of global memory:                 6402 MBytes (6712545280 bytes)
  ( 2) Multiprocessors, (128) CUDA Cores/MP:     256 CUDA Cores
  GPU Max Clock rate:                            1275 MHz (1.27 GHz)
  Memory Clock rate:                             1600 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, CUDA Runtime Version = 9.2, NumDevs = 1
Result = PASS

benchmark.py:

import argparse
from PIL import Image
import sys
import os
import urllib
import tensorflow.contrib.tensorrt as trt
#import matplotlib
#matplotlib.use('Agg')
#import matplotlib.pyplot as plt
#import matplotlib.patches as patches
import tensorflow as tf
import numpy as np
import time
from tf_trt_models.detection import download_detection_model, build_detection_graph

MODEL = 'ssd_inception_v2_coco'
DATA_DIR = './data/'
IMAGE_PATH = './examples/detection/data/huskies.jpg'

def parse_args():
    """Parse input arguments."""
    desc = ('TRT benchmark')
    parser = argparse.ArgumentParser(description=desc)
    parser.add_argument('--model', dest='model',
                        help='name of the object detecion model [{}]'.format(MODEL),
                        default=MODEL, type=str)
    parser.add_argument('--trt', dest='use_trt',
                        help='build and test TensorRT model',
                        action='store_true')

    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    print('Called with args: {}'.format(args))
    CONFIG_FILE = args.model + '.config'   # ./data/ssd_inception_v2_coco.config 
    CHECKPOINT_FILE = 'model.ckpt'    # ./data/ssd_inception_v2_coco/model.ckpt

    config_path, checkpoint_path = download_detection_model(args.model, 'data')

    frozen_graph, input_names, output_names = build_detection_graph(
        config=config_path,
        checkpoint=checkpoint_path,
        score_threshold=0.3,
        batch_size=1
    )

    print('Model: {}'.format(args.model))
    print(output_names)
    print('Total nodes in the original graph: {}'.format(len([1 for n in frozen_graph.node])))

    if args.use_trt:
        trt_graph = trt.create_inference_graph(
            input_graph_def=frozen_graph,
            outputs=output_names,
            max_batch_size=1,
            max_workspace_size_bytes=1 << 25,
            precision_mode='FP16',
            minimum_segment_size=50
        )

        all_nodes = len([1 for n in trt_graph.node])
        trt_engine_nodes = len([1 for n in trt_graph.node if str(n.op) == 'TRTEngineOp'])
        print('Total nodes in the optimized graph: {} out of which {} are TRTEngineOp'.format(all_nodes, trt_engine_nodes))

    print('Creating the session')

    tf_config = tf.ConfigProto()
    tf_config.gpu_options.allow_growth = True

    tf_sess = tf.Session(config=tf_config)

    if args.use_trt:
        print('Running with TRT model')
        tf.import_graph_def(trt_graph, name='')
    else:
        print('Running with ORIGINAL model')
        tf.import_graph_def(frozen_graph, name='')

    tf_input = tf_sess.graph.get_tensor_by_name(input_names[0] + ':0')
    tf_scores = tf_sess.graph.get_tensor_by_name('detection_scores:0')
    tf_boxes = tf_sess.graph.get_tensor_by_name('detection_boxes:0')
    tf_classes = tf_sess.graph.get_tensor_by_name('detection_classes:0')
    tf_num_detections = tf_sess.graph.get_tensor_by_name('num_detections:0')

    image = Image.open(IMAGE_PATH)
    image_resized = np.array(image.resize((300, 300)))
    image = np.array(image)

    print('Running the inference on a single image to warm up the net')
    t0 = time.time()
    scores, boxes, classes, num_detections = tf_sess.run([tf_scores, tf_boxes, tf_classes, tf_num_detections], feed_dict={
        tf_input: image_resized[None, ...]
    })
    t1 = time.time()
    print('Runtime: {:.2f} seconds'.format(t1 - t0))

    boxes = boxes[0] # index by 0 to remove batch dimension
    scores = scores[0]
    classes = classes[0]
    num_detections = num_detections[0]

    print('Running the benchmark')

    num_samples = 50

    t0 = time.time()
    for i in range(num_samples):
        scores, boxes, classes, num_detections = tf_sess.run([tf_scores, tf_boxes, tf_classes, tf_num_detections], feed_dict={
            tf_input: image_resized[None, ...]
        })
    t1 = time.time()
    print('Average runtime: %f seconds' % (float(t1 - t0) / num_samples))

    tf_sess.close()

if __name__ == '__main__':
    main()

FULL LOGS

original ssd_mobilenet_v1_coco:

$ python3 benchmark.py --model ssd_mobilenet_v1_coco
Called with args: Namespace(model='ssd_mobilenet_v1_coco', use_trt=False)
--2019-02-18 04:37:35--  http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_coco_2018_01_28.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 172.217.20.48, 2a00:1450:4005:80a::2010
Connecting to download.tensorflow.org (download.tensorflow.org)|172.217.20.48|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 76541073 (73M) [application/x-tar]
Saving to: ‘data/ssd_mobilenet_v1_coco_2018_01_28.tar.gz’

data/ssd_mobilenet_v1_coco_2018_01_28.tar.gz                           100%[============================================================================================================================================================================>]  73.00M  10.9MB/s    in 6.8s

2019-02-18 04:37:42 (10.7 MB/s) - ‘data/ssd_mobilenet_v1_coco_2018_01_28.tar.gz’ saved [76541073/76541073]

2019-02-18 04:37:44.136212: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-18 04:37:44.136411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.25GiB freeMemory: 4.42GiB
2019-02-18 04:37:44.136531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:37:46.326887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:37:46.327035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:37:46.327095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:37:46.328349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-18 04:38:20.291676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:38:20.291834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:38:20.291934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:38:20.291984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:38:20.292106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:38:30.064116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:38:30.064307: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:38:30.064367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:38:30.064408: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:38:30.064522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:38:33.110725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:38:33.110866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:38:33.110909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:38:33.110947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:38:33.111053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_mobilenet_v1_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 7571
Creating the session
2019-02-18 04:38:40.770776: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:38:40.770951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:38:40.771024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:38:40.771075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:38:40.771218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with ORIGINAL model
Running the inference on a single image to warm up the net
2019-02-18 04:38:59.904909: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:39:00.059551: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:39:00.245903: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:39:00.595751: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.26GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Runtime: 15.87 seconds
Running the benchmark
Average runtime: 0.051792 seconds

ssd_mobilenet_v1_coco with TRT:

$ python3 benchmark.py --model ssd_mobilenet_v1_coco --trt
Called with args: Namespace(model='ssd_mobilenet_v1_coco', use_trt=True)
2019-02-18 04:45:21.845684: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-18 04:45:21.845872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.25GiB freeMemory: 4.14GiB
2019-02-18 04:45:21.845993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:45:23.202972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:45:23.203113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:45:23.203161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:45:23.203376: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-18 04:45:57.904372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:45:57.904526: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:45:57.904569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:45:57.904611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:45:57.904770: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:46:07.725793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:46:07.725949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:46:07.725994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:46:07.726033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:46:07.726171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:46:10.766851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:46:10.766994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:46:10.767038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:46:10.767077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:46:10.767207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_mobilenet_v1_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 7571
2019-02-18 04:46:26.024094: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0
2019-02-18 04:46:26.030507: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2019-02-18 04:46:26.037184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:46:26.037433: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:46:26.037485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:46:26.037524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:46:26.037659: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:46:32.355050: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:46:32.355353: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:46:32.494402: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true
2019-02-18 04:46:32.494563: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 434 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/FusedBatchNorm. Skipping...
2019-02-18 04:46:36.095038: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:46:36.095299: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:46:36.370087: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:891] Failed to register segment graphdef as a function 0: Invalid argument: Cannot add function 'my_trt_op_0_native_segment' because a different function with the same name already exists.
2019-02-18 04:46:37.200207: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-18 04:46:37.444711: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-18 04:46:37.532859: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph
2019-02-18 04:46:37.533062: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 6503 nodes (-1068), 8572 edges (-1676), time = 2288.70093ms.
2019-02-18 04:46:37.533107: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 6518 nodes (15), 8598 edges (26), time = 761.058ms.
2019-02-18 04:46:37.533146: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 6518 nodes (0), 8598 edges (0), time = 3022.36499ms.
2019-02-18 04:46:37.533186: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 6518 nodes (0), 8598 edges (0), time = 802.444ms.
2019-02-18 04:46:37.533224: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 6518 nodes (0), 8598 edges (0), time = 3086.44897ms.
2019-02-18 04:46:37.533343: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment
2019-02-18 04:46:37.533435: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 435 nodes (0), 503 edges (0), time = 267.055ms.
2019-02-18 04:46:37.533475: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 435 nodes (0), 503 edges (0), time = 155.322ms.
2019-02-18 04:46:37.533512: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 435 nodes (0), 503 edges (0), time = 26.109ms.
2019-02-18 04:46:37.533549: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 435 nodes (0), 503 edges (0), time = 217.434ms.
2019-02-18 04:46:37.533584: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 435 nodes (0), 503 edges (0), time = 26.192ms.
Total nodes in the optimized graph: 6518 out of which 0 are TRTEngineOp
Creating the session
2019-02-18 04:46:38.540298: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:46:38.540442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:46:38.540527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:46:38.540569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:46:38.540681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with TRT model
Running the inference on a single image to warm up the net
2019-02-18 04:46:55.954166: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:46:56.109469: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:46:56.295034: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:46:56.642952: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.26GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Runtime: 13.62 seconds
Running the benchmark
Average runtime: 0.053618 seconds

original ssd_mobilenet_v2_coco:

$ python3 benchmark.py --model ssd_mobilenet_v2_coco
Called with args: Namespace(model='ssd_mobilenet_v2_coco', use_trt=False)
--2019-02-18 04:47:46--  http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_coco_2018_03_29.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 172.217.20.48, 2a00:1450:400f:806::2010
Connecting to download.tensorflow.org (download.tensorflow.org)|172.217.20.48|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 187925923 (179M) [application/x-tar]
Saving to: ‘data/ssd_mobilenet_v2_coco_2018_03_29.tar.gz’

data/ssd_mobilenet_v2_coco_2018_03_29.tar.gz                           100%[============================================================================================================================================================================>] 179.22M  10.3MB/s    in 18s

2019-02-18 04:48:04 (10.1 MB/s) - ‘data/ssd_mobilenet_v2_coco_2018_03_29.tar.gz’ saved [187925923/187925923]

2019-02-18 04:48:08.325486: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-18 04:48:08.325631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.25GiB freeMemory: 3.74GiB
2019-02-18 04:48:08.325694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:48:09.640993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:48:09.641157: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:48:09.641197: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:48:09.641522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-18 04:48:48.706244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:48:48.706398: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:48:48.706442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:48:48.706478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:48:48.706590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:49:00.113150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:49:00.113377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:49:00.113423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:49:00.113471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:49:00.113610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:49:04.441489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:49:04.441627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:49:04.441673: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:49:04.441713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:49:04.441841: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_mobilenet_v2_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 8062
Creating the session
2019-02-18 04:49:13.197948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:49:13.198120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:49:13.198175: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:49:13.198224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:49:13.198413: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with ORIGINAL model
Running the inference on a single image to warm up the net
2019-02-18 04:49:38.590984: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.53GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:49:38.607152: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.84GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Runtime: 20.24 seconds
Running the benchmark
Average runtime: 0.084560 seconds

ssd_mobilenet_v2_coco with TRT:

$ python3 benchmark.py --model ssd_mobilenet_v2_coco --trt
Called with args: Namespace(model='ssd_mobilenet_v2_coco', use_trt=True)
2019-02-18 04:50:30.934503: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-18 04:50:30.934779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.25GiB freeMemory: 4.41GiB
2019-02-18 04:50:30.934939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:50:32.242912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:50:32.243053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:50:32.243093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:50:32.243487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-18 04:51:11.865247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:51:11.865473: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:51:11.865536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:51:11.865573: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:51:11.865702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:51:23.567995: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:51:23.568248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:51:23.568312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:51:23.568349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:51:23.568464: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:51:27.799623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:51:27.799797: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:51:27.799840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:51:27.799877: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:51:27.799983: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_mobilenet_v2_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 8062
2019-02-18 04:51:45.670572: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0
2019-02-18 04:51:45.676597: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2019-02-18 04:51:45.680622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:51:45.680794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:51:45.680839: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:51:45.680884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:51:45.681008: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:51:53.469662: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:51:53.470062: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:51:53.727136: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true
2019-02-18 04:51:53.727286: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 780 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/MobilenetV2/Conv/BatchNorm/FusedBatchNorm. Skipping...
2019-02-18 04:51:58.248570: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:51:58.249029: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:51:59.108786: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:891] Failed to register segment graphdef as a function 0: Invalid argument: Cannot add function 'my_trt_op_0_native_segment' because a different function with the same name already exists.
2019-02-18 04:52:00.692498: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-18 04:52:01.438146: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-18 04:52:01.613665: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph
2019-02-18 04:52:01.613843: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 6850 nodes (-1212), 8953 edges (-1820), time = 2828.16ms.
2019-02-18 04:52:01.613888: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 6865 nodes (15), 8979 edges (26), time = 835.079ms.
2019-02-18 04:52:01.613927: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 6865 nodes (0), 8979 edges (0), time = 3860.146ms.
2019-02-18 04:52:01.613970: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 6865 nodes (0), 8979 edges (0), time = 994.512ms.
2019-02-18 04:52:01.614012: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 6865 nodes (0), 8979 edges (0), time = 4454.34082ms.
2019-02-18 04:52:01.614078: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment
2019-02-18 04:52:01.614118: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 781 nodes (0), 883 edges (0), time = 624.612ms.
2019-02-18 04:52:01.614207: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 781 nodes (0), 883 edges (0), time = 396.25ms.
2019-02-18 04:52:01.614248: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 781 nodes (0), 883 edges (0), time = 49.671ms.
2019-02-18 04:52:01.614286: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 781 nodes (0), 883 edges (0), time = 695.233ms.
2019-02-18 04:52:01.614322: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 781 nodes (0), 883 edges (0), time = 53.073ms.
Total nodes in the optimized graph: 6865 out of which 0 are TRTEngineOp
Creating the session
2019-02-18 04:52:03.094518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:52:03.094648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:52:03.094692: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:52:03.094730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:52:03.094885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with TRT model
Running the inference on a single image to warm up the net
2019-02-18 04:52:38.719800: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.84GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Runtime: 31.43 seconds
Running the benchmark
Average runtime: 0.093455 seconds

original ssd_inception_v2_coco:

$ python3 benchmark.py --model ssd_inception_v2_coco
Called with args: Namespace(model='ssd_inception_v2_coco', use_trt=False)
2019-02-18 04:10:13.974149: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-18 04:10:13.974348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.25GiB freeMemory: 4.29GiB
2019-02-18 04:10:13.974412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:10:15.398904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:10:15.399058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:10:15.399103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:10:15.399360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-18 04:10:58.050991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:10:58.051141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:10:58.051184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:10:58.051231: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:10:58.051349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:11:10.699534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:11:10.699689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:11:10.699743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:11:10.699792: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:11:10.699910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:11:15.841888: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:11:15.842028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:11:15.842070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:11:15.842106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:11:15.842217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_inception_v2_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 8278
Creating the session
2019-02-18 04:11:26.078042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:11:26.078225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:11:26.078275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:11:26.078319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:11:26.078547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with ORIGINAL model
Running the inference on a single image to warm up the net
2019-02-18 04:11:57.791649: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Runtime: 26.46 seconds
Running the benchmark
Average runtime: 0.100977 seconds

ssd_inception_v2_coco with TRT:

nvidia@dpx2tegraa-lund:~/dariusz/projects/nvidia/tf_trt_models$ python3 benchmark.py --model ssd_inception_v2_coco --trt
Called with args: Namespace(model='ssd_inception_v2_coco', use_trt=True)
2019-02-18 04:18:05.555255: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-18 04:18:05.555595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.25GiB freeMemory: 4.54GiB
2019-02-18 04:18:05.555680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:18:07.756886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:18:07.757042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:18:07.757086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:18:07.757460: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-18 04:18:50.782544: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:18:50.782700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:18:50.782752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:18:50.782788: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:18:50.782916: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:19:03.670306: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:19:03.670446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:19:03.670493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:19:03.670534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:19:03.670669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:19:08.911273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:19:08.911411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:19:08.911473: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:19:08.911513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:19:08.911773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_inception_v2_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 8278
2019-02-18 04:19:29.284995: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0
2019-02-18 04:19:29.290733: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2019-02-18 04:19:29.295736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:19:29.295909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:19:29.295956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:19:29.295996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:19:29.296253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:19:38.501962: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:19:38.502486: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:19:38.903763: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true
2019-02-18 04:19:38.903994: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 931 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/BatchNorm/FusedBatchNorm. Skipping...
2019-02-18 04:19:45.094916: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:19:45.095409: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:19:46.228436: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true
2019-02-18 04:19:46.228742: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 931 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/BatchNorm/FusedBatchNorm. Skipping...
2019-02-18 04:19:48.180974: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-18 04:19:48.992334: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-18 04:19:49.170470: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph
2019-02-18 04:19:49.170642: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 7000 nodes (-1278), 9181 edges (-1886), time = 3424.68ms.
2019-02-18 04:19:49.170691: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 7025 nodes (25), 9207 edges (26), time = 979.995ms.
2019-02-18 04:19:49.170731: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 7025 nodes (0), 9207 edges (0), time = 4509.50488ms.
2019-02-18 04:19:49.170900: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 7015 nodes (-10), 9207 edges (0), time = 1884.54199ms.
2019-02-18 04:19:49.170941: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 7015 nodes (0), 9207 edges (0), time = 5634.18701ms.
2019-02-18 04:19:49.170979: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment
2019-02-18 04:19:49.171017: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 932 nodes (0), 1112 edges (0), time = 784.723ms.
2019-02-18 04:19:49.171070: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Invalid argument: The graph is already optimized by layout optimizer.
2019-02-18 04:19:49.171108: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 932 nodes (0), 1112 edges (0), time = 53.572ms.
2019-02-18 04:19:49.171146: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 932 nodes (0), 1112 edges (0), time = 757.245ms.
2019-02-18 04:19:49.171181: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 932 nodes (0), 1112 edges (0), time = 54.12ms.
Total nodes in the optimized graph: 7015 out of which 0 are TRTEngineOp
Creating the session
2019-02-18 04:19:51.273352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:19:51.273486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:19:51.273532: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:19:51.273572: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:19:51.273691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with TRT model
Running the inference on a single image to warm up the net
2019-02-18 04:20:35.683734: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Runtime: 38.11 seconds
Running the benchmark
Average runtime: 0.106853 seconds

Additional pieces of information - when I enable python logging by simply adding this piece of code to the very beginning of main() in benchmark.py:

logging.basicConfig(level=logging.DEBUG,
                    format='%(asctime)s.%(msecs)03d %(levelname)-8s %(threadName)-10s %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    handlers=[
                        logging.FileHandler('benchmark.log', 'w'), # mode 'w' for overwrite, 'a' for append
                        logging.StreamHandler(sys.stdout)
                    ])
    logger = logging.getLogger(__name__)
    # Ask tensorflow logger not to propagate logs to parent (which causes
    # duplicated logging)
    logging.getLogger('tensorflow').propagate = False

I see that TensorFlow claims it runs against TensorRT version 4.0.0, even though I have version 4.1.1 installed (see above for environment details). TensorFlow was built on the very same machine with no changes to TensorRT whatsoever.

Total nodes in the original graph: 8062
INFO:tensorflow:Running against TensorRT version 4.0.0
2019-02-18 06:38:11.224120: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0
2019-02-18 06:38:11.233967: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 06:38:11.234015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 06:38:11.234055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 06:38:11.234180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3849 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)

As you can see, it also claims that number of eligible GPUs is zero, but still creates TensorFlow device.
I tried the same with

export TF_MIN_GPU_MULTIPROCESSOR_COUNT=2

but there was no difference.

Does the TensorRT version mismatch and no eligible GPU matter in this case?

Another question on this - looking closely to the logs one can see errors causing skipping creation of my_trt_op_0:

2019-02-18 04:51:53.727136: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true
2019-02-18 04:51:53.727286: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 780 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/MobilenetV2/Conv/BatchNorm/FusedBatchNorm. Skipping...
2019-02-18 04:51:58.248570: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:51:58.249029: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:51:59.108786: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:891] Failed to register segment graphdef as a function 0: Invalid argument: Cannot add function 'my_trt_op_0_native_segment' because a different function with the same name already exists.

Why does it happen and how to avoid it?

Hello,

using benchmark.py on a TX2 with jetpack 3.3, I’m seeing performance improvements with TRT vs. TF

TF	        TRT
ssd_mobilenet_v1_coco	0.049736	0.036951
ssd_mobilenet_v2_coco	0.102131	0.042651
ssd_inception_v2_coco	0.1101	0.040059

MY GPU

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X2"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    6.2
  Total amount of global memory:                 7846 MBytes (8227401728 bytes)
  ( 2) Multiprocessors, (128) CUDA Cores/MP:     256 CUDA Cores
  GPU Max Clock rate:                            1301 MHz (1.30 GHz)
  Memory Clock rate:                             1600 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

TRT config

nvidia@tegra-ubuntu:/usr/local/cuda/samples/1_Utilities/deviceQuery$ dpkg -l | grep nvinfer
ii  libnvinfer-dev                              4.1.3-1+cuda9.0                               arm64        TensorRT development libraries and headers
ii  libnvinfer-samples                          4.1.3-1+cuda9.0                               arm64        TensorRT samples and documentation
ii  libnvinfer4                                 4.1.3-1+cuda9.0                               arm64        TensorRT runtime libraries

TensorFlow config

root@tegra-ubuntu:/home/scratch.zhenyih_sw/jetson/tf_trt_models# python
Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
>>> tensorflow.__version__
'1.11.0'

The " no eligible GPU " is expected. I’m not seeing the op skipping messages you are seeing.

root@tegra-ubuntu:/home/scratch.zhenyih_sw/jetson/tf_trt_models# python benchmark.py --model ssd_inception_v2_coco --trt
Called with args: Namespace(model='ssd_inception_v2_coco', use_trt=True)
2019-02-21 21:28:16.221037: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-21 21:28:16.221172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.66GiB freeMemory: 2.57GiB
2019-02-21 21:28:16.221230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-02-21 21:28:17.428181: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-21 21:28:17.428280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2019-02-21 21:28:17.428309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2019-02-21 21:28:17.428595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2020 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python2.7/site-packages/object_detection-0.1-py2.7.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-21 21:29:11.710866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-02-21 21:29:11.711111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-21 21:29:11.711154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2019-02-21 21:29:11.711184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2019-02-21 21:29:11.711289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2020 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-21 21:30:03.883130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-02-21 21:30:03.883377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-21 21:30:03.883438: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2019-02-21 21:30:03.883470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2019-02-21 21:30:03.883586: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2020 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-21 21:30:18.984016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-02-21 21:30:18.984116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-21 21:30:18.984149: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2019-02-21 21:30:18.984172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2019-02-21 21:30:18.984279: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2020 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_inception_v2_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 8278
2019-02-21 21:31:22.059748: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0
2019-02-21 21:31:22.060275: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2019-02-21 21:31:22.060798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-02-21 21:31:22.060938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-21 21:31:22.060997: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2019-02-21 21:31:22.061025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2019-02-21 21:31:22.061138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2020 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-21 21:31:34.270468: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2952] Segment @scope '', converted to graph
2019-02-21 21:31:34.270783: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-21 21:33:22.447185: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:952] Engine my_trt_op_0 creation for segment 0, composed of 931 nodes succeeded.
2019-02-21 21:33:27.242045: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-21 21:33:27.841326: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-21 21:33:28.027736: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:404] Optimization results for grappler item: tf_graph
2019-02-21 21:33:28.027989: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   constant folding: Graph size after: 7000 nodes (-1278), 9181 edges (-1886), time = 2841.95898ms.
2019-02-21 21:33:28.028026: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   layout: Graph size after: 7025 nodes (25), 9207 edges (26), time = 754.749ms.
2019-02-21 21:33:28.028052: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   TensorRTOptimizer: Graph size after: 6095 nodes (-930), 8096 edges (-1111), time = 111755.695ms.
2019-02-21 21:33:28.028081: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   constant folding: Graph size after: 6085 nodes (-10), 8096 edges (0), time = 1223.552ms.
2019-02-21 21:33:28.028106: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   TensorRTOptimizer: Graph size after: 6085 nodes (0), 8096 edges (0), time = 2281.38696ms.
2019-02-21 21:33:28.028129: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:404] Optimization results for grappler item: my_trt_op_0_native_segment
2019-02-21 21:33:28.028154: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   constant folding: Graph size after: 932 nodes (0), 1112 edges (0), time = 566.961ms.
2019-02-21 21:33:28.028183: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   layout: Invalid argument: The graph is already optimized by layout optimizer.
2019-02-21 21:33:28.028215: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   TensorRTOptimizer: Graph size after: 932 nodes (0), 1112 edges (0), time = 57.022ms.
2019-02-21 21:33:28.028240: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   constant folding: Graph size after: 932 nodes (0), 1112 edges (0), time = 541.715ms.
2019-02-21 21:33:28.028264: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   TensorRTOptimizer: Graph size after: 932 nodes (0), 1112 edges (0), time = 55.732ms.
Total nodes in the optimized graph: 6085 out of which 1 are TRTEngineOp
Creating the session
2019-02-21 21:35:02.447477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-02-21 21:35:02.447599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-21 21:35:02.447636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2019-02-21 21:35:02.447668: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2019-02-21 21:35:02.447782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2020 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with TRT model
Running the inference on a single image to warm up the net
Runtime: 11.35 seconds
Running the benchmark
Average runtime: 0.040059 seconds

I’d recommend reflashing your xavier to resolve any mix trt version issues. then follow the instructions at GitHub - NVIDIA-AI-IOT/tf_trt_models: TensorFlow models accelerated with NVIDIA TensorRT .