No improvements from TensorRT on NVIDIA-AI-IOT/tf_trt_models

dariusz.filipski · February 18, 2019, 1:25pm

I can’t get any improvements from TensorRT on Drive PX 2 AutoChauffeur (P2379, the one without dGPU). I simply clonned your Jetson example from https://github.com/NVIDIA-AI-IOT/tf_trt_models and created a benchmark.py script, which is not much but a copy-paste from https://github.com/NVIDIA-AI-IOT/tf_trt_models/blob/master/examples/detection/detection.ipynb. Since Jetson TX2 has similar specs as one node my Drive PX 2, I expected similar values and improvements as shown in the table at https://github.com/NVIDIA-AI-IOT/tf_trt_models#models-1
Unfortunately, in my case I see no difference in inference speed between the original models and TensorRT ones (I could even argue there’s a slight drop in performance). Here’s what I see (full logs below):

ssd_mobilenet_v1_coco: Original - 0.051792s, TRT - 0.053618s
ssd_mobilenet_v2_coco: Original - 0.084560s, TRT - 0.093455s
ssd_inception_v2_coco: Original - 0.100977s, TRT - 0.106853s

Taking a closer look, it seems that TRT slims down the graph by ~1000 nodes but fails to put anything to TRTEngineOp:

ssd_mobilenet_v1_coco: Original - 7571 nodes, TRT - 6518 nodes out of which 0 are TRTEngineOp
ssd_mobilenet_v2_coco: Original - 8062 nodes, TRT - 6865 nodes out of which 0 are TRTEngineOp
ssd_inception_v2_coco: Original - 8278 nodes, TRT - 7015 nodes out of which 0 are TRTEngineOp

I see errors like the following one in the logs as well:

Engine my_trt_op_0 creation for segment 0, composed of 434 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/FusedBatchNorm. Skipping...

What’s wrong? How to make TensorRT work?

My configuration

TensorFlow 1.12.0 built from sources with TRT support.
Protobuf updated according to https://devtalk.nvidia.com/default/topic/1046492/tensorrt/extremely-long-time-to-load-trt-optimized-frozen-tf-graphs/post/5315675/#5315675

$ protoc --version
libprotoc 3.6.1

TensorRT config:

$ dpkg -l | grep nvinfer
ii  libnvinfer-dev                             4.1.1-1+cuda9.2                               arm64        TensorRT development libraries and headers
ii  libnvinfer-samples                         4.1.1-1+cuda9.2                               arm64        TensorRT samples and documentation
ii  libnvinfer4                                4.1.1-1+cuda9.2                               arm64        TensorRT runtime libraries

GPU data:

$ /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery
/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X2"
  CUDA Driver Version / Runtime Version          9.2 / 9.2
  CUDA Capability Major/Minor version number:    6.2
  Total amount of global memory:                 6402 MBytes (6712545280 bytes)
  ( 2) Multiprocessors, (128) CUDA Cores/MP:     256 CUDA Cores
  GPU Max Clock rate:                            1275 MHz (1.27 GHz)
  Memory Clock rate:                             1600 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, CUDA Runtime Version = 9.2, NumDevs = 1
Result = PASS

benchmark.py:

import argparse
from PIL import Image
import sys
import os
import urllib
import tensorflow.contrib.tensorrt as trt
#import matplotlib
#matplotlib.use('Agg')
#import matplotlib.pyplot as plt
#import matplotlib.patches as patches
import tensorflow as tf
import numpy as np
import time
from tf_trt_models.detection import download_detection_model, build_detection_graph

MODEL = 'ssd_inception_v2_coco'
DATA_DIR = './data/'
IMAGE_PATH = './examples/detection/data/huskies.jpg'

def parse_args():
    """Parse input arguments."""
    desc = ('TRT benchmark')
    parser = argparse.ArgumentParser(description=desc)
    parser.add_argument('--model', dest='model',
                        help='name of the object detecion model [{}]'.format(MODEL),
                        default=MODEL, type=str)
    parser.add_argument('--trt', dest='use_trt',
                        help='build and test TensorRT model',
                        action='store_true')

    args = parser.parse_args()
    return args

def main():
    args = parse_args()
    print('Called with args: {}'.format(args))
    CONFIG_FILE = args.model + '.config'   # ./data/ssd_inception_v2_coco.config 
    CHECKPOINT_FILE = 'model.ckpt'    # ./data/ssd_inception_v2_coco/model.ckpt

    config_path, checkpoint_path = download_detection_model(args.model, 'data')

    frozen_graph, input_names, output_names = build_detection_graph(
        config=config_path,
        checkpoint=checkpoint_path,
        score_threshold=0.3,
        batch_size=1
    )

    print('Model: {}'.format(args.model))
    print(output_names)
    print('Total nodes in the original graph: {}'.format(len([1 for n in frozen_graph.node])))

    if args.use_trt:
        trt_graph = trt.create_inference_graph(
            input_graph_def=frozen_graph,
            outputs=output_names,
            max_batch_size=1,
            max_workspace_size_bytes=1 << 25,
            precision_mode='FP16',
            minimum_segment_size=50
        )

        all_nodes = len([1 for n in trt_graph.node])
        trt_engine_nodes = len([1 for n in trt_graph.node if str(n.op) == 'TRTEngineOp'])
        print('Total nodes in the optimized graph: {} out of which {} are TRTEngineOp'.format(all_nodes, trt_engine_nodes))

    print('Creating the session')

    tf_config = tf.ConfigProto()
    tf_config.gpu_options.allow_growth = True

    tf_sess = tf.Session(config=tf_config)

    if args.use_trt:
        print('Running with TRT model')
        tf.import_graph_def(trt_graph, name='')
    else:
        print('Running with ORIGINAL model')
        tf.import_graph_def(frozen_graph, name='')

    tf_input = tf_sess.graph.get_tensor_by_name(input_names[0] + ':0')
    tf_scores = tf_sess.graph.get_tensor_by_name('detection_scores:0')
    tf_boxes = tf_sess.graph.get_tensor_by_name('detection_boxes:0')
    tf_classes = tf_sess.graph.get_tensor_by_name('detection_classes:0')
    tf_num_detections = tf_sess.graph.get_tensor_by_name('num_detections:0')

    image = Image.open(IMAGE_PATH)
    image_resized = np.array(image.resize((300, 300)))
    image = np.array(image)

    print('Running the inference on a single image to warm up the net')
    t0 = time.time()
    scores, boxes, classes, num_detections = tf_sess.run([tf_scores, tf_boxes, tf_classes, tf_num_detections], feed_dict={
        tf_input: image_resized[None, ...]
    })
    t1 = time.time()
    print('Runtime: {:.2f} seconds'.format(t1 - t0))

    boxes = boxes[0] # index by 0 to remove batch dimension
    scores = scores[0]
    classes = classes[0]
    num_detections = num_detections[0]

    print('Running the benchmark')

    num_samples = 50

    t0 = time.time()
    for i in range(num_samples):
        scores, boxes, classes, num_detections = tf_sess.run([tf_scores, tf_boxes, tf_classes, tf_num_detections], feed_dict={
            tf_input: image_resized[None, ...]
        })
    t1 = time.time()
    print('Average runtime: %f seconds' % (float(t1 - t0) / num_samples))

    tf_sess.close()

if __name__ == '__main__':
    main()

FULL LOGS

original ssd_mobilenet_v1_coco:

$ python3 benchmark.py --model ssd_mobilenet_v1_coco
Called with args: Namespace(model='ssd_mobilenet_v1_coco', use_trt=False)
--2019-02-18 04:37:35--  http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_coco_2018_01_28.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 172.217.20.48, 2a00:1450:4005:80a::2010
Connecting to download.tensorflow.org (download.tensorflow.org)|172.217.20.48|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 76541073 (73M) [application/x-tar]
Saving to: ‘data/ssd_mobilenet_v1_coco_2018_01_28.tar.gz’

data/ssd_mobilenet_v1_coco_2018_01_28.tar.gz                           100%[============================================================================================================================================================================>]  73.00M  10.9MB/s    in 6.8s

2019-02-18 04:37:42 (10.7 MB/s) - ‘data/ssd_mobilenet_v1_coco_2018_01_28.tar.gz’ saved [76541073/76541073]

2019-02-18 04:37:44.136212: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-18 04:37:44.136411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.25GiB freeMemory: 4.42GiB
2019-02-18 04:37:44.136531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:37:46.326887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:37:46.327035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:37:46.327095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:37:46.328349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-18 04:38:20.291676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:38:20.291834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:38:20.291934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:38:20.291984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:38:20.292106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:38:30.064116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:38:30.064307: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:38:30.064367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:38:30.064408: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:38:30.064522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:38:33.110725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:38:33.110866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:38:33.110909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:38:33.110947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:38:33.111053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_mobilenet_v1_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 7571
Creating the session
2019-02-18 04:38:40.770776: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:38:40.770951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:38:40.771024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:38:40.771075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:38:40.771218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with ORIGINAL model
Running the inference on a single image to warm up the net
2019-02-18 04:38:59.904909: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:39:00.059551: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:39:00.245903: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:39:00.595751: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.26GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Runtime: 15.87 seconds
Running the benchmark
Average runtime: 0.051792 seconds

ssd_mobilenet_v1_coco with TRT:

$ python3 benchmark.py --model ssd_mobilenet_v1_coco --trt
Called with args: Namespace(model='ssd_mobilenet_v1_coco', use_trt=True)
2019-02-18 04:45:21.845684: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-18 04:45:21.845872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.25GiB freeMemory: 4.14GiB
2019-02-18 04:45:21.845993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:45:23.202972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:45:23.203113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:45:23.203161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:45:23.203376: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-18 04:45:57.904372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:45:57.904526: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:45:57.904569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:45:57.904611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:45:57.904770: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:46:07.725793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:46:07.725949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:46:07.725994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:46:07.726033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:46:07.726171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:46:10.766851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:46:10.766994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:46:10.767038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:46:10.767077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:46:10.767207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_mobilenet_v1_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 7571
2019-02-18 04:46:26.024094: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0
2019-02-18 04:46:26.030507: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2019-02-18 04:46:26.037184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:46:26.037433: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:46:26.037485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:46:26.037524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:46:26.037659: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:46:32.355050: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:46:32.355353: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:46:32.494402: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true
2019-02-18 04:46:32.494563: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 434 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/FusedBatchNorm. Skipping...
2019-02-18 04:46:36.095038: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:46:36.095299: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:46:36.370087: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:891] Failed to register segment graphdef as a function 0: Invalid argument: Cannot add function 'my_trt_op_0_native_segment' because a different function with the same name already exists.
2019-02-18 04:46:37.200207: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-18 04:46:37.444711: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-18 04:46:37.532859: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph
2019-02-18 04:46:37.533062: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 6503 nodes (-1068), 8572 edges (-1676), time = 2288.70093ms.
2019-02-18 04:46:37.533107: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 6518 nodes (15), 8598 edges (26), time = 761.058ms.
2019-02-18 04:46:37.533146: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 6518 nodes (0), 8598 edges (0), time = 3022.36499ms.
2019-02-18 04:46:37.533186: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 6518 nodes (0), 8598 edges (0), time = 802.444ms.
2019-02-18 04:46:37.533224: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 6518 nodes (0), 8598 edges (0), time = 3086.44897ms.
2019-02-18 04:46:37.533343: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment
2019-02-18 04:46:37.533435: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 435 nodes (0), 503 edges (0), time = 267.055ms.
2019-02-18 04:46:37.533475: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 435 nodes (0), 503 edges (0), time = 155.322ms.
2019-02-18 04:46:37.533512: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 435 nodes (0), 503 edges (0), time = 26.109ms.
2019-02-18 04:46:37.533549: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 435 nodes (0), 503 edges (0), time = 217.434ms.
2019-02-18 04:46:37.533584: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 435 nodes (0), 503 edges (0), time = 26.192ms.
Total nodes in the optimized graph: 6518 out of which 0 are TRTEngineOp
Creating the session
2019-02-18 04:46:38.540298: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:46:38.540442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:46:38.540527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:46:38.540569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:46:38.540681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with TRT model
Running the inference on a single image to warm up the net
2019-02-18 04:46:55.954166: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:46:56.109469: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:46:56.295034: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:46:56.642952: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.26GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Runtime: 13.62 seconds
Running the benchmark
Average runtime: 0.053618 seconds

original ssd_mobilenet_v2_coco:

$ python3 benchmark.py --model ssd_mobilenet_v2_coco
Called with args: Namespace(model='ssd_mobilenet_v2_coco', use_trt=False)
--2019-02-18 04:47:46--  http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_coco_2018_03_29.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 172.217.20.48, 2a00:1450:400f:806::2010
Connecting to download.tensorflow.org (download.tensorflow.org)|172.217.20.48|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 187925923 (179M) [application/x-tar]
Saving to: ‘data/ssd_mobilenet_v2_coco_2018_03_29.tar.gz’

data/ssd_mobilenet_v2_coco_2018_03_29.tar.gz                           100%[============================================================================================================================================================================>] 179.22M  10.3MB/s    in 18s

2019-02-18 04:48:04 (10.1 MB/s) - ‘data/ssd_mobilenet_v2_coco_2018_03_29.tar.gz’ saved [187925923/187925923]

2019-02-18 04:48:08.325486: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-18 04:48:08.325631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.25GiB freeMemory: 3.74GiB
2019-02-18 04:48:08.325694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:48:09.640993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:48:09.641157: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:48:09.641197: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:48:09.641522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-18 04:48:48.706244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:48:48.706398: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:48:48.706442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:48:48.706478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:48:48.706590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:49:00.113150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:49:00.113377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:49:00.113423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:49:00.113471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:49:00.113610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:49:04.441489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:49:04.441627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:49:04.441673: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:49:04.441713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:49:04.441841: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_mobilenet_v2_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 8062
Creating the session
2019-02-18 04:49:13.197948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:49:13.198120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:49:13.198175: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:49:13.198224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:49:13.198413: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with ORIGINAL model
Running the inference on a single image to warm up the net
2019-02-18 04:49:38.590984: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.53GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:49:38.607152: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.84GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Runtime: 20.24 seconds
Running the benchmark
Average runtime: 0.084560 seconds

ssd_mobilenet_v2_coco with TRT:

$ python3 benchmark.py --model ssd_mobilenet_v2_coco --trt
Called with args: Namespace(model='ssd_mobilenet_v2_coco', use_trt=True)
2019-02-18 04:50:30.934503: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-18 04:50:30.934779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.25GiB freeMemory: 4.41GiB
2019-02-18 04:50:30.934939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:50:32.242912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:50:32.243053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:50:32.243093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:50:32.243487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-18 04:51:11.865247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:51:11.865473: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:51:11.865536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:51:11.865573: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:51:11.865702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:51:23.567995: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:51:23.568248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:51:23.568312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:51:23.568349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:51:23.568464: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:51:27.799623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:51:27.799797: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:51:27.799840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:51:27.799877: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:51:27.799983: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_mobilenet_v2_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 8062
2019-02-18 04:51:45.670572: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0
2019-02-18 04:51:45.676597: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2019-02-18 04:51:45.680622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:51:45.680794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:51:45.680839: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:51:45.680884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:51:45.681008: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:51:53.469662: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:51:53.470062: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:51:53.727136: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true
2019-02-18 04:51:53.727286: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 780 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/MobilenetV2/Conv/BatchNorm/FusedBatchNorm. Skipping...
2019-02-18 04:51:58.248570: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:51:58.249029: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:51:59.108786: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:891] Failed to register segment graphdef as a function 0: Invalid argument: Cannot add function 'my_trt_op_0_native_segment' because a different function with the same name already exists.
2019-02-18 04:52:00.692498: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-18 04:52:01.438146: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-18 04:52:01.613665: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph
2019-02-18 04:52:01.613843: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 6850 nodes (-1212), 8953 edges (-1820), time = 2828.16ms.
2019-02-18 04:52:01.613888: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 6865 nodes (15), 8979 edges (26), time = 835.079ms.
2019-02-18 04:52:01.613927: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 6865 nodes (0), 8979 edges (0), time = 3860.146ms.
2019-02-18 04:52:01.613970: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 6865 nodes (0), 8979 edges (0), time = 994.512ms.
2019-02-18 04:52:01.614012: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 6865 nodes (0), 8979 edges (0), time = 4454.34082ms.
2019-02-18 04:52:01.614078: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment
2019-02-18 04:52:01.614118: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 781 nodes (0), 883 edges (0), time = 624.612ms.
2019-02-18 04:52:01.614207: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 781 nodes (0), 883 edges (0), time = 396.25ms.
2019-02-18 04:52:01.614248: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 781 nodes (0), 883 edges (0), time = 49.671ms.
2019-02-18 04:52:01.614286: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 781 nodes (0), 883 edges (0), time = 695.233ms.
2019-02-18 04:52:01.614322: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 781 nodes (0), 883 edges (0), time = 53.073ms.
Total nodes in the optimized graph: 6865 out of which 0 are TRTEngineOp
Creating the session
2019-02-18 04:52:03.094518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:52:03.094648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:52:03.094692: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:52:03.094730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:52:03.094885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with TRT model
Running the inference on a single image to warm up the net
2019-02-18 04:52:38.719800: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.84GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Runtime: 31.43 seconds
Running the benchmark
Average runtime: 0.093455 seconds

original ssd_inception_v2_coco:

$ python3 benchmark.py --model ssd_inception_v2_coco
Called with args: Namespace(model='ssd_inception_v2_coco', use_trt=False)
2019-02-18 04:10:13.974149: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-18 04:10:13.974348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.25GiB freeMemory: 4.29GiB
2019-02-18 04:10:13.974412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:10:15.398904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:10:15.399058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:10:15.399103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:10:15.399360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-18 04:10:58.050991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:10:58.051141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:10:58.051184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:10:58.051231: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:10:58.051349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:11:10.699534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:11:10.699689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:11:10.699743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:11:10.699792: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:11:10.699910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:11:15.841888: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:11:15.842028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:11:15.842070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:11:15.842106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:11:15.842217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_inception_v2_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 8278
Creating the session
2019-02-18 04:11:26.078042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:11:26.078225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:11:26.078275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:11:26.078319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:11:26.078547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with ORIGINAL model
Running the inference on a single image to warm up the net
2019-02-18 04:11:57.791649: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Runtime: 26.46 seconds
Running the benchmark
Average runtime: 0.100977 seconds

ssd_inception_v2_coco with TRT:

nvidia@dpx2tegraa-lund:~/dariusz/projects/nvidia/tf_trt_models$ python3 benchmark.py --model ssd_inception_v2_coco --trt
Called with args: Namespace(model='ssd_inception_v2_coco', use_trt=True)
2019-02-18 04:18:05.555255: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-18 04:18:05.555595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.25GiB freeMemory: 4.54GiB
2019-02-18 04:18:05.555680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:18:07.756886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:18:07.757042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:18:07.757086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:18:07.757460: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-18 04:18:50.782544: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:18:50.782700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:18:50.782752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:18:50.782788: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:18:50.782916: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:19:03.670306: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:19:03.670446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:19:03.670493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:19:03.670534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:19:03.670669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:19:08.911273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:19:08.911411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:19:08.911473: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:19:08.911513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:19:08.911773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_inception_v2_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 8278
2019-02-18 04:19:29.284995: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0
2019-02-18 04:19:29.290733: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2019-02-18 04:19:29.295736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:19:29.295909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:19:29.295956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:19:29.295996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:19:29.296253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:19:38.501962: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:19:38.502486: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:19:38.903763: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true
2019-02-18 04:19:38.903994: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 931 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/BatchNorm/FusedBatchNorm. Skipping...
2019-02-18 04:19:45.094916: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:19:45.095409: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:19:46.228436: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true
2019-02-18 04:19:46.228742: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 931 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/BatchNorm/FusedBatchNorm. Skipping...
2019-02-18 04:19:48.180974: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-18 04:19:48.992334: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-18 04:19:49.170470: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph
2019-02-18 04:19:49.170642: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 7000 nodes (-1278), 9181 edges (-1886), time = 3424.68ms.
2019-02-18 04:19:49.170691: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 7025 nodes (25), 9207 edges (26), time = 979.995ms.
2019-02-18 04:19:49.170731: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 7025 nodes (0), 9207 edges (0), time = 4509.50488ms.
2019-02-18 04:19:49.170900: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 7015 nodes (-10), 9207 edges (0), time = 1884.54199ms.
2019-02-18 04:19:49.170941: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 7015 nodes (0), 9207 edges (0), time = 5634.18701ms.
2019-02-18 04:19:49.170979: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment
2019-02-18 04:19:49.171017: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 932 nodes (0), 1112 edges (0), time = 784.723ms.
2019-02-18 04:19:49.171070: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Invalid argument: The graph is already optimized by layout optimizer.
2019-02-18 04:19:49.171108: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 932 nodes (0), 1112 edges (0), time = 53.572ms.
2019-02-18 04:19:49.171146: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 932 nodes (0), 1112 edges (0), time = 757.245ms.
2019-02-18 04:19:49.171181: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 932 nodes (0), 1112 edges (0), time = 54.12ms.
Total nodes in the optimized graph: 7015 out of which 0 are TRTEngineOp
Creating the session
2019-02-18 04:19:51.273352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:19:51.273486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:19:51.273532: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 04:19:51.273572: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 04:19:51.273691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with TRT model
Running the inference on a single image to warm up the net
2019-02-18 04:20:35.683734: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Runtime: 38.11 seconds
Running the benchmark
Average runtime: 0.106853 seconds

dariusz.filipski · February 18, 2019, 2:48pm

Additional pieces of information - when I enable python logging by simply adding this piece of code to the very beginning of main() in benchmark.py:

logging.basicConfig(level=logging.DEBUG,
                    format='%(asctime)s.%(msecs)03d %(levelname)-8s %(threadName)-10s %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    handlers=[
                        logging.FileHandler('benchmark.log', 'w'), # mode 'w' for overwrite, 'a' for append
                        logging.StreamHandler(sys.stdout)
                    ])
    logger = logging.getLogger(__name__)
    # Ask tensorflow logger not to propagate logs to parent (which causes
    # duplicated logging)
    logging.getLogger('tensorflow').propagate = False

I see that TensorFlow claims it runs against TensorRT version 4.0.0, even though I have version 4.1.1 installed (see above for environment details). TensorFlow was built on the very same machine with no changes to TensorRT whatsoever.

Total nodes in the original graph: 8062
INFO:tensorflow:Running against TensorRT version 4.0.0
2019-02-18 06:38:11.224120: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0
2019-02-18 06:38:11.233967: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 06:38:11.234015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-18 06:38:11.234055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-18 06:38:11.234180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3849 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)

As you can see, it also claims that number of eligible GPUs is zero, but still creates TensorFlow device.
I tried the same with

export TF_MIN_GPU_MULTIPROCESSOR_COUNT=2

but there was no difference.

Does the TensorRT version mismatch and no eligible GPU matter in this case?

dariusz.filipski · February 19, 2019, 1:50pm

Another question on this - looking closely to the logs one can see errors causing skipping creation of my_trt_op_0:

2019-02-18 04:51:53.727136: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true
2019-02-18 04:51:53.727286: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 780 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/MobilenetV2/Conv/BatchNorm/FusedBatchNorm. Skipping...
2019-02-18 04:51:58.248570: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:51:58.249029: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:51:59.108786: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:891] Failed to register segment graphdef as a function 0: Invalid argument: Cannot add function 'my_trt_op_0_native_segment' because a different function with the same name already exists.

Why does it happen and how to avoid it?

NVES · February 21, 2019, 9:46pm

Hello,

using benchmark.py on a TX2 with jetpack 3.3, I’m seeing performance improvements with TRT vs. TF

TF	        TRT
ssd_mobilenet_v1_coco	0.049736	0.036951
ssd_mobilenet_v2_coco	0.102131	0.042651
ssd_inception_v2_coco	0.1101	0.040059

MY GPU

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X2"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    6.2
  Total amount of global memory:                 7846 MBytes (8227401728 bytes)
  ( 2) Multiprocessors, (128) CUDA Cores/MP:     256 CUDA Cores
  GPU Max Clock rate:                            1301 MHz (1.30 GHz)
  Memory Clock rate:                             1600 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

TRT config

nvidia@tegra-ubuntu:/usr/local/cuda/samples/1_Utilities/deviceQuery$ dpkg -l | grep nvinfer
ii  libnvinfer-dev                              4.1.3-1+cuda9.0                               arm64        TensorRT development libraries and headers
ii  libnvinfer-samples                          4.1.3-1+cuda9.0                               arm64        TensorRT samples and documentation
ii  libnvinfer4                                 4.1.3-1+cuda9.0                               arm64        TensorRT runtime libraries

TensorFlow config

root@tegra-ubuntu:/home/scratch.zhenyih_sw/jetson/tf_trt_models# python
Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
>>> tensorflow.__version__
'1.11.0'

The " no eligible GPU " is expected. I’m not seeing the op skipping messages you are seeing.

root@tegra-ubuntu:/home/scratch.zhenyih_sw/jetson/tf_trt_models# python benchmark.py --model ssd_inception_v2_coco --trt
Called with args: Namespace(model='ssd_inception_v2_coco', use_trt=True)
2019-02-21 21:28:16.221037: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-21 21:28:16.221172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.66GiB freeMemory: 2.57GiB
2019-02-21 21:28:16.221230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-02-21 21:28:17.428181: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-21 21:28:17.428280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2019-02-21 21:28:17.428309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2019-02-21 21:28:17.428595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2020 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python2.7/site-packages/object_detection-0.1-py2.7.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-21 21:29:11.710866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-02-21 21:29:11.711111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-21 21:29:11.711154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2019-02-21 21:29:11.711184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2019-02-21 21:29:11.711289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2020 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-21 21:30:03.883130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-02-21 21:30:03.883377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-21 21:30:03.883438: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2019-02-21 21:30:03.883470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2019-02-21 21:30:03.883586: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2020 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-21 21:30:18.984016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-02-21 21:30:18.984116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-21 21:30:18.984149: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2019-02-21 21:30:18.984172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2019-02-21 21:30:18.984279: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2020 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_inception_v2_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 8278
2019-02-21 21:31:22.059748: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0
2019-02-21 21:31:22.060275: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2019-02-21 21:31:22.060798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-02-21 21:31:22.060938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-21 21:31:22.060997: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2019-02-21 21:31:22.061025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2019-02-21 21:31:22.061138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2020 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-21 21:31:34.270468: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2952] Segment @scope '', converted to graph
2019-02-21 21:31:34.270783: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-21 21:33:22.447185: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:952] Engine my_trt_op_0 creation for segment 0, composed of 931 nodes succeeded.
2019-02-21 21:33:27.242045: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-21 21:33:27.841326: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-21 21:33:28.027736: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:404] Optimization results for grappler item: tf_graph
2019-02-21 21:33:28.027989: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   constant folding: Graph size after: 7000 nodes (-1278), 9181 edges (-1886), time = 2841.95898ms.
2019-02-21 21:33:28.028026: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   layout: Graph size after: 7025 nodes (25), 9207 edges (26), time = 754.749ms.
2019-02-21 21:33:28.028052: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   TensorRTOptimizer: Graph size after: 6095 nodes (-930), 8096 edges (-1111), time = 111755.695ms.
2019-02-21 21:33:28.028081: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   constant folding: Graph size after: 6085 nodes (-10), 8096 edges (0), time = 1223.552ms.
2019-02-21 21:33:28.028106: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   TensorRTOptimizer: Graph size after: 6085 nodes (0), 8096 edges (0), time = 2281.38696ms.
2019-02-21 21:33:28.028129: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:404] Optimization results for grappler item: my_trt_op_0_native_segment
2019-02-21 21:33:28.028154: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   constant folding: Graph size after: 932 nodes (0), 1112 edges (0), time = 566.961ms.
2019-02-21 21:33:28.028183: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   layout: Invalid argument: The graph is already optimized by layout optimizer.
2019-02-21 21:33:28.028215: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   TensorRTOptimizer: Graph size after: 932 nodes (0), 1112 edges (0), time = 57.022ms.
2019-02-21 21:33:28.028240: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   constant folding: Graph size after: 932 nodes (0), 1112 edges (0), time = 541.715ms.
2019-02-21 21:33:28.028264: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406]   TensorRTOptimizer: Graph size after: 932 nodes (0), 1112 edges (0), time = 55.732ms.
Total nodes in the optimized graph: 6085 out of which 1 are TRTEngineOp
Creating the session
2019-02-21 21:35:02.447477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-02-21 21:35:02.447599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-21 21:35:02.447636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2019-02-21 21:35:02.447668: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2019-02-21 21:35:02.447782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2020 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with TRT model
Running the inference on a single image to warm up the net
Runtime: 11.35 seconds
Running the benchmark
Average runtime: 0.040059 seconds

I’d recommend reflashing your xavier to resolve any mix trt version issues. then follow the instructions at GitHub - NVIDIA-AI-IOT/tf_trt_models: TensorFlow models accelerated with NVIDIA TensorRT .

Topic		Replies	Views
Don't get any 'TRTEngineOp' after optimizing model via TensorRT in Jeton TX2 TensorRT	17	3676	October 12, 2021
TensorFlow object detection and image classification accelerated for NVIDIA Jetson Jetson TX2	25	10540	June 3, 2019
TF-TRT issue Jetson TX2	26	3863	October 18, 2021
TRT issue with Graph Creation - TRTEngineOP TensorRT	12	3144	November 4, 2019
TensorRT Integration Speeds Up TensorFlow Inference Technical Blog	40	840	March 27, 2020
TensorRT (TF-TRT) doesn't improve TF model in GeForce 1060? TensorRT	7	2938	January 18, 2019
Inference Time is not stable TensorRT	10	1757	January 3, 2019
No improvement in inference performance after Opt. with TensorRT TensorRT	6	1230	April 15, 2020
Model inferencing with TensorRT on Jetson (TX2) Jetson TX2	4	951	October 18, 2021
After converting ssdMobilnet from the examples, the model is slower Jetson Xavier NX tensorrt	4	504	October 18, 2021

No improvements from TensorRT on NVIDIA-AI-IOT/tf_trt_models

FULL LOGS

Related topics