I can’t get any improvements from TensorRT on Drive PX 2 AutoChauffeur (P2379, the one without dGPU). I simply clonned your Jetson example from https://github.com/NVIDIA-AI-IOT/tf_trt_models and created a benchmark.py script, which is not much but a copy-paste from https://github.com/NVIDIA-AI-IOT/tf_trt_models/blob/master/examples/detection/detection.ipynb. Since Jetson TX2 has similar specs as one node my Drive PX 2, I expected similar values and improvements as shown in the table at https://github.com/NVIDIA-AI-IOT/tf_trt_models#models-1
Unfortunately, in my case I see no difference in inference speed between the original models and TensorRT ones (I could even argue there’s a slight drop in performance). Here’s what I see (full logs below):
ssd_mobilenet_v1_coco: Original - 0.051792s, TRT - 0.053618s
ssd_mobilenet_v2_coco: Original - 0.084560s, TRT - 0.093455s
ssd_inception_v2_coco: Original - 0.100977s, TRT - 0.106853s
Taking a closer look, it seems that TRT slims down the graph by ~1000 nodes but fails to put anything to TRTEngineOp:
ssd_mobilenet_v1_coco: Original - 7571 nodes, TRT - 6518 nodes out of which 0 are TRTEngineOp
ssd_mobilenet_v2_coco: Original - 8062 nodes, TRT - 6865 nodes out of which 0 are TRTEngineOp
ssd_inception_v2_coco: Original - 8278 nodes, TRT - 7015 nodes out of which 0 are TRTEngineOp
I see errors like the following one in the logs as well:
Engine my_trt_op_0 creation for segment 0, composed of 434 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/FusedBatchNorm. Skipping...
What’s wrong? How to make TensorRT work?
My configuration
TensorFlow 1.12.0 built from sources with TRT support.
Protobuf updated according to https://devtalk.nvidia.com/default/topic/1046492/tensorrt/extremely-long-time-to-load-trt-optimized-frozen-tf-graphs/post/5315675/#5315675
$ protoc --version
libprotoc 3.6.1
TensorRT config:
$ dpkg -l | grep nvinfer
ii libnvinfer-dev 4.1.1-1+cuda9.2 arm64 TensorRT development libraries and headers
ii libnvinfer-samples 4.1.1-1+cuda9.2 arm64 TensorRT samples and documentation
ii libnvinfer4 4.1.1-1+cuda9.2 arm64 TensorRT runtime libraries
GPU data:
$ /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery
/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA Tegra X2"
CUDA Driver Version / Runtime Version 9.2 / 9.2
CUDA Capability Major/Minor version number: 6.2
Total amount of global memory: 6402 MBytes (6712545280 bytes)
( 2) Multiprocessors, (128) CUDA Cores/MP: 256 CUDA Cores
GPU Max Clock rate: 1275 MHz (1.27 GHz)
Memory Clock rate: 1600 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, CUDA Runtime Version = 9.2, NumDevs = 1
Result = PASS
benchmark.py:
import argparse
from PIL import Image
import sys
import os
import urllib
import tensorflow.contrib.tensorrt as trt
#import matplotlib
#matplotlib.use('Agg')
#import matplotlib.pyplot as plt
#import matplotlib.patches as patches
import tensorflow as tf
import numpy as np
import time
from tf_trt_models.detection import download_detection_model, build_detection_graph
MODEL = 'ssd_inception_v2_coco'
DATA_DIR = './data/'
IMAGE_PATH = './examples/detection/data/huskies.jpg'
def parse_args():
"""Parse input arguments."""
desc = ('TRT benchmark')
parser = argparse.ArgumentParser(description=desc)
parser.add_argument('--model', dest='model',
help='name of the object detecion model [{}]'.format(MODEL),
default=MODEL, type=str)
parser.add_argument('--trt', dest='use_trt',
help='build and test TensorRT model',
action='store_true')
args = parser.parse_args()
return args
def main():
args = parse_args()
print('Called with args: {}'.format(args))
CONFIG_FILE = args.model + '.config' # ./data/ssd_inception_v2_coco.config
CHECKPOINT_FILE = 'model.ckpt' # ./data/ssd_inception_v2_coco/model.ckpt
config_path, checkpoint_path = download_detection_model(args.model, 'data')
frozen_graph, input_names, output_names = build_detection_graph(
config=config_path,
checkpoint=checkpoint_path,
score_threshold=0.3,
batch_size=1
)
print('Model: {}'.format(args.model))
print(output_names)
print('Total nodes in the original graph: {}'.format(len([1 for n in frozen_graph.node])))
if args.use_trt:
trt_graph = trt.create_inference_graph(
input_graph_def=frozen_graph,
outputs=output_names,
max_batch_size=1,
max_workspace_size_bytes=1 << 25,
precision_mode='FP16',
minimum_segment_size=50
)
all_nodes = len([1 for n in trt_graph.node])
trt_engine_nodes = len([1 for n in trt_graph.node if str(n.op) == 'TRTEngineOp'])
print('Total nodes in the optimized graph: {} out of which {} are TRTEngineOp'.format(all_nodes, trt_engine_nodes))
print('Creating the session')
tf_config = tf.ConfigProto()
tf_config.gpu_options.allow_growth = True
tf_sess = tf.Session(config=tf_config)
if args.use_trt:
print('Running with TRT model')
tf.import_graph_def(trt_graph, name='')
else:
print('Running with ORIGINAL model')
tf.import_graph_def(frozen_graph, name='')
tf_input = tf_sess.graph.get_tensor_by_name(input_names[0] + ':0')
tf_scores = tf_sess.graph.get_tensor_by_name('detection_scores:0')
tf_boxes = tf_sess.graph.get_tensor_by_name('detection_boxes:0')
tf_classes = tf_sess.graph.get_tensor_by_name('detection_classes:0')
tf_num_detections = tf_sess.graph.get_tensor_by_name('num_detections:0')
image = Image.open(IMAGE_PATH)
image_resized = np.array(image.resize((300, 300)))
image = np.array(image)
print('Running the inference on a single image to warm up the net')
t0 = time.time()
scores, boxes, classes, num_detections = tf_sess.run([tf_scores, tf_boxes, tf_classes, tf_num_detections], feed_dict={
tf_input: image_resized[None, ...]
})
t1 = time.time()
print('Runtime: {:.2f} seconds'.format(t1 - t0))
boxes = boxes[0] # index by 0 to remove batch dimension
scores = scores[0]
classes = classes[0]
num_detections = num_detections[0]
print('Running the benchmark')
num_samples = 50
t0 = time.time()
for i in range(num_samples):
scores, boxes, classes, num_detections = tf_sess.run([tf_scores, tf_boxes, tf_classes, tf_num_detections], feed_dict={
tf_input: image_resized[None, ...]
})
t1 = time.time()
print('Average runtime: %f seconds' % (float(t1 - t0) / num_samples))
tf_sess.close()
if __name__ == '__main__':
main()
FULL LOGS
original ssd_mobilenet_v1_coco:
$ python3 benchmark.py --model ssd_mobilenet_v1_coco
Called with args: Namespace(model='ssd_mobilenet_v1_coco', use_trt=False)
--2019-02-18 04:37:35-- http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_coco_2018_01_28.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 172.217.20.48, 2a00:1450:4005:80a::2010
Connecting to download.tensorflow.org (download.tensorflow.org)|172.217.20.48|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 76541073 (73M) [application/x-tar]
Saving to: ‘data/ssd_mobilenet_v1_coco_2018_01_28.tar.gz’
data/ssd_mobilenet_v1_coco_2018_01_28.tar.gz 100%[============================================================================================================================================================================>] 73.00M 10.9MB/s in 6.8s
2019-02-18 04:37:42 (10.7 MB/s) - ‘data/ssd_mobilenet_v1_coco_2018_01_28.tar.gz’ saved [76541073/76541073]
2019-02-18 04:37:44.136212: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-18 04:37:44.136411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.25GiB freeMemory: 4.42GiB
2019-02-18 04:37:44.136531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:37:46.326887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:37:46.327035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:37:46.327095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:37:46.328349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-18 04:38:20.291676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:38:20.291834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:38:20.291934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:38:20.291984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:38:20.292106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:38:30.064116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:38:30.064307: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:38:30.064367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:38:30.064408: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:38:30.064522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:38:33.110725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:38:33.110866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:38:33.110909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:38:33.110947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:38:33.111053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_mobilenet_v1_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 7571
Creating the session
2019-02-18 04:38:40.770776: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:38:40.770951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:38:40.771024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:38:40.771075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:38:40.771218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with ORIGINAL model
Running the inference on a single image to warm up the net
2019-02-18 04:38:59.904909: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:39:00.059551: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:39:00.245903: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:39:00.595751: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.26GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Runtime: 15.87 seconds
Running the benchmark
Average runtime: 0.051792 seconds
ssd_mobilenet_v1_coco with TRT:
$ python3 benchmark.py --model ssd_mobilenet_v1_coco --trt
Called with args: Namespace(model='ssd_mobilenet_v1_coco', use_trt=True)
2019-02-18 04:45:21.845684: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-18 04:45:21.845872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.25GiB freeMemory: 4.14GiB
2019-02-18 04:45:21.845993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:45:23.202972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:45:23.203113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:45:23.203161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:45:23.203376: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-18 04:45:57.904372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:45:57.904526: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:45:57.904569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:45:57.904611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:45:57.904770: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:46:07.725793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:46:07.725949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:46:07.725994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:46:07.726033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:46:07.726171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:46:10.766851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:46:10.766994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:46:10.767038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:46:10.767077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:46:10.767207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_mobilenet_v1_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 7571
2019-02-18 04:46:26.024094: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0
2019-02-18 04:46:26.030507: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2019-02-18 04:46:26.037184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:46:26.037433: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:46:26.037485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:46:26.037524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:46:26.037659: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:46:32.355050: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:46:32.355353: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:46:32.494402: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true
2019-02-18 04:46:32.494563: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 434 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/FusedBatchNorm. Skipping...
2019-02-18 04:46:36.095038: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:46:36.095299: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:46:36.370087: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:891] Failed to register segment graphdef as a function 0: Invalid argument: Cannot add function 'my_trt_op_0_native_segment' because a different function with the same name already exists.
2019-02-18 04:46:37.200207: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-18 04:46:37.444711: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-18 04:46:37.532859: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph
2019-02-18 04:46:37.533062: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 6503 nodes (-1068), 8572 edges (-1676), time = 2288.70093ms.
2019-02-18 04:46:37.533107: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Graph size after: 6518 nodes (15), 8598 edges (26), time = 761.058ms.
2019-02-18 04:46:37.533146: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 6518 nodes (0), 8598 edges (0), time = 3022.36499ms.
2019-02-18 04:46:37.533186: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 6518 nodes (0), 8598 edges (0), time = 802.444ms.
2019-02-18 04:46:37.533224: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 6518 nodes (0), 8598 edges (0), time = 3086.44897ms.
2019-02-18 04:46:37.533343: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment
2019-02-18 04:46:37.533435: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 435 nodes (0), 503 edges (0), time = 267.055ms.
2019-02-18 04:46:37.533475: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Graph size after: 435 nodes (0), 503 edges (0), time = 155.322ms.
2019-02-18 04:46:37.533512: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 435 nodes (0), 503 edges (0), time = 26.109ms.
2019-02-18 04:46:37.533549: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 435 nodes (0), 503 edges (0), time = 217.434ms.
2019-02-18 04:46:37.533584: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 435 nodes (0), 503 edges (0), time = 26.192ms.
Total nodes in the optimized graph: 6518 out of which 0 are TRTEngineOp
Creating the session
2019-02-18 04:46:38.540298: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:46:38.540442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:46:38.540527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:46:38.540569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:46:38.540681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with TRT model
Running the inference on a single image to warm up the net
2019-02-18 04:46:55.954166: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:46:56.109469: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:46:56.295034: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:46:56.642952: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.26GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Runtime: 13.62 seconds
Running the benchmark
Average runtime: 0.053618 seconds
original ssd_mobilenet_v2_coco:
$ python3 benchmark.py --model ssd_mobilenet_v2_coco
Called with args: Namespace(model='ssd_mobilenet_v2_coco', use_trt=False)
--2019-02-18 04:47:46-- http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_coco_2018_03_29.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 172.217.20.48, 2a00:1450:400f:806::2010
Connecting to download.tensorflow.org (download.tensorflow.org)|172.217.20.48|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 187925923 (179M) [application/x-tar]
Saving to: ‘data/ssd_mobilenet_v2_coco_2018_03_29.tar.gz’
data/ssd_mobilenet_v2_coco_2018_03_29.tar.gz 100%[============================================================================================================================================================================>] 179.22M 10.3MB/s in 18s
2019-02-18 04:48:04 (10.1 MB/s) - ‘data/ssd_mobilenet_v2_coco_2018_03_29.tar.gz’ saved [187925923/187925923]
2019-02-18 04:48:08.325486: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-18 04:48:08.325631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.25GiB freeMemory: 3.74GiB
2019-02-18 04:48:08.325694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:48:09.640993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:48:09.641157: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:48:09.641197: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:48:09.641522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-18 04:48:48.706244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:48:48.706398: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:48:48.706442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:48:48.706478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:48:48.706590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:49:00.113150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:49:00.113377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:49:00.113423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:49:00.113471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:49:00.113610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:49:04.441489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:49:04.441627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:49:04.441673: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:49:04.441713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:49:04.441841: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_mobilenet_v2_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 8062
Creating the session
2019-02-18 04:49:13.197948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:49:13.198120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:49:13.198175: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:49:13.198224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:49:13.198413: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with ORIGINAL model
Running the inference on a single image to warm up the net
2019-02-18 04:49:38.590984: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.53GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-02-18 04:49:38.607152: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.84GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Runtime: 20.24 seconds
Running the benchmark
Average runtime: 0.084560 seconds
ssd_mobilenet_v2_coco with TRT:
$ python3 benchmark.py --model ssd_mobilenet_v2_coco --trt
Called with args: Namespace(model='ssd_mobilenet_v2_coco', use_trt=True)
2019-02-18 04:50:30.934503: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-18 04:50:30.934779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.25GiB freeMemory: 4.41GiB
2019-02-18 04:50:30.934939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:50:32.242912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:50:32.243053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:50:32.243093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:50:32.243487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-18 04:51:11.865247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:51:11.865473: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:51:11.865536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:51:11.865573: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:51:11.865702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:51:23.567995: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:51:23.568248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:51:23.568312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:51:23.568349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:51:23.568464: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:51:27.799623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:51:27.799797: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:51:27.799840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:51:27.799877: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:51:27.799983: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_mobilenet_v2_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 8062
2019-02-18 04:51:45.670572: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0
2019-02-18 04:51:45.676597: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2019-02-18 04:51:45.680622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:51:45.680794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:51:45.680839: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:51:45.680884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:51:45.681008: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:51:53.469662: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:51:53.470062: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:51:53.727136: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true
2019-02-18 04:51:53.727286: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 780 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/MobilenetV2/Conv/BatchNorm/FusedBatchNorm. Skipping...
2019-02-18 04:51:58.248570: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:51:58.249029: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:51:59.108786: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:891] Failed to register segment graphdef as a function 0: Invalid argument: Cannot add function 'my_trt_op_0_native_segment' because a different function with the same name already exists.
2019-02-18 04:52:00.692498: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-18 04:52:01.438146: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-18 04:52:01.613665: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph
2019-02-18 04:52:01.613843: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 6850 nodes (-1212), 8953 edges (-1820), time = 2828.16ms.
2019-02-18 04:52:01.613888: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Graph size after: 6865 nodes (15), 8979 edges (26), time = 835.079ms.
2019-02-18 04:52:01.613927: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 6865 nodes (0), 8979 edges (0), time = 3860.146ms.
2019-02-18 04:52:01.613970: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 6865 nodes (0), 8979 edges (0), time = 994.512ms.
2019-02-18 04:52:01.614012: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 6865 nodes (0), 8979 edges (0), time = 4454.34082ms.
2019-02-18 04:52:01.614078: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment
2019-02-18 04:52:01.614118: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 781 nodes (0), 883 edges (0), time = 624.612ms.
2019-02-18 04:52:01.614207: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Graph size after: 781 nodes (0), 883 edges (0), time = 396.25ms.
2019-02-18 04:52:01.614248: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 781 nodes (0), 883 edges (0), time = 49.671ms.
2019-02-18 04:52:01.614286: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 781 nodes (0), 883 edges (0), time = 695.233ms.
2019-02-18 04:52:01.614322: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 781 nodes (0), 883 edges (0), time = 53.073ms.
Total nodes in the optimized graph: 6865 out of which 0 are TRTEngineOp
Creating the session
2019-02-18 04:52:03.094518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:52:03.094648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:52:03.094692: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:52:03.094730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:52:03.094885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with TRT model
Running the inference on a single image to warm up the net
2019-02-18 04:52:38.719800: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.84GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Runtime: 31.43 seconds
Running the benchmark
Average runtime: 0.093455 seconds
original ssd_inception_v2_coco:
$ python3 benchmark.py --model ssd_inception_v2_coco
Called with args: Namespace(model='ssd_inception_v2_coco', use_trt=False)
2019-02-18 04:10:13.974149: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-18 04:10:13.974348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.25GiB freeMemory: 4.29GiB
2019-02-18 04:10:13.974412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:10:15.398904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:10:15.399058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:10:15.399103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:10:15.399360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-18 04:10:58.050991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:10:58.051141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:10:58.051184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:10:58.051231: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:10:58.051349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:11:10.699534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:11:10.699689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:11:10.699743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:11:10.699792: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:11:10.699910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:11:15.841888: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:11:15.842028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:11:15.842070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:11:15.842106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:11:15.842217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_inception_v2_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 8278
Creating the session
2019-02-18 04:11:26.078042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:11:26.078225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:11:26.078275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:11:26.078319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:11:26.078547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with ORIGINAL model
Running the inference on a single image to warm up the net
2019-02-18 04:11:57.791649: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Runtime: 26.46 seconds
Running the benchmark
Average runtime: 0.100977 seconds
ssd_inception_v2_coco with TRT:
nvidia@dpx2tegraa-lund:~/dariusz/projects/nvidia/tf_trt_models$ python3 benchmark.py --model ssd_inception_v2_coco --trt
Called with args: Namespace(model='ssd_inception_v2_coco', use_trt=True)
2019-02-18 04:18:05.555255: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-02-18 04:18:05.555595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.25GiB freeMemory: 4.54GiB
2019-02-18 04:18:05.555680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:18:07.756886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:18:07.757042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:18:07.757086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:18:07.757460: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2019-02-18 04:18:50.782544: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:18:50.782700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:18:50.782752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:18:50.782788: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:18:50.782916: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:19:03.670306: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:19:03.670446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:19:03.670493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:19:03.670534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:19:03.670669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:19:08.911273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:19:08.911411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:19:08.911473: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:19:08.911513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:19:08.911773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Model: ssd_inception_v2_coco
['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']
Total nodes in the original graph: 8278
2019-02-18 04:19:29.284995: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0
2019-02-18 04:19:29.290733: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2019-02-18 04:19:29.295736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:19:29.295909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:19:29.295956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:19:29.295996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:19:29.296253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2019-02-18 04:19:38.501962: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:19:38.502486: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:19:38.903763: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true
2019-02-18 04:19:38.903994: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 931 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/BatchNorm/FusedBatchNorm. Skipping...
2019-02-18 04:19:45.094916: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-02-18 04:19:45.095409: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-02-18 04:19:46.228436: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true
2019-02-18 04:19:46.228742: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 931 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/BatchNorm/FusedBatchNorm. Skipping...
2019-02-18 04:19:48.180974: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-18 04:19:48.992334: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-02-18 04:19:49.170470: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph
2019-02-18 04:19:49.170642: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 7000 nodes (-1278), 9181 edges (-1886), time = 3424.68ms.
2019-02-18 04:19:49.170691: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Graph size after: 7025 nodes (25), 9207 edges (26), time = 979.995ms.
2019-02-18 04:19:49.170731: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 7025 nodes (0), 9207 edges (0), time = 4509.50488ms.
2019-02-18 04:19:49.170900: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 7015 nodes (-10), 9207 edges (0), time = 1884.54199ms.
2019-02-18 04:19:49.170941: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 7015 nodes (0), 9207 edges (0), time = 5634.18701ms.
2019-02-18 04:19:49.170979: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment
2019-02-18 04:19:49.171017: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 932 nodes (0), 1112 edges (0), time = 784.723ms.
2019-02-18 04:19:49.171070: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Invalid argument: The graph is already optimized by layout optimizer.
2019-02-18 04:19:49.171108: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 932 nodes (0), 1112 edges (0), time = 53.572ms.
2019-02-18 04:19:49.171146: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 932 nodes (0), 1112 edges (0), time = 757.245ms.
2019-02-18 04:19:49.171181: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 932 nodes (0), 1112 edges (0), time = 54.12ms.
Total nodes in the optimized graph: 7015 out of which 0 are TRTEngineOp
Creating the session
2019-02-18 04:19:51.273352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-18 04:19:51.273486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 04:19:51.273532: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-18 04:19:51.273572: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-18 04:19:51.273691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Running with TRT model
Running the inference on a single image to warm up the net
2019-02-18 04:20:35.683734: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Runtime: 38.11 seconds
Running the benchmark
Average runtime: 0.106853 seconds