Hello, guys.
I have a problem with tensorRT and its implementation in TensorFlow.
I’m trying to create tensorRT engine for inception_v3 with TFTRT
My code:
import os
import time
import pickle
import cv2
import numpy as np
import tensorflow as tf
import matplotlib.image as mpl
import tensorflow.contrib.tensorrt as trt
from keras import backend as K
from keras.models import load_model
from sklearn.metrics import mean_squared_error
# load and prepare images
res = []
for image in sorted(list(os.walk('/images/2019_01_23/'))[0][2]):
loaded_image = mpl.imread('/images/2019_01_23/' + image)
res.append(cv2.resize(loaded_image ,(299,299))[:,:,:3].reshape(299,299,3)/128.0-1)
res = np.array(res)
#load keras model
file_path = '/nn.hdf5'
sess = tf.Session()
K.set_session(sess)
K.set_learning_phase(0)
model = load_model(file_path, compile=False)
K.set_learning_phase(0)
output_name = model.output.op.name
input_name = model.input.op.name
graph_def = tf.graph_util.remove_training_nodes(
tf.graph_util.convert_variables_to_constants(
sess,
sess.graph.as_graph_def(),
[output_name]
)
)
batch_size = 500
data = res[:batch_size].astype(np.float32)
# Inference with TF-TRT frozen graph workflow:
graph = tf.Graph()
with graph.as_default():
with tf.Session() as sess:
# Now you can create a TensorRT inference graph from your
# frozen graph:
trt_graph = trt.create_inference_graph(
input_graph_def=graph_def,
outputs=['concatenate_9/concat'],
max_batch_size=500,
max_workspace_size_bytes=8589934592,
precision_mode='FP32'
)
# Import the TensorRT graph into a new graph and run:
output_node = tf.import_graph_def(
trt_graph,
# {'input_1': tf.convert_to_tensor(data)},
{'input_1': data},
return_elements=['concatenate_9/concat']
)
t = time.time()
val = sess.run([output_node[0].outputs[0]])
print(time.time() - t)
In logs i see following error
2019-01-25 12:27:44.218551: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-25 12:27:44.218625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-25 12:27:44.218645: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-01-25 12:27:44.218669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-01-25 12:27:44.218975: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15099 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
2019-01-25 12:27:44.882986: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 1
2019-01-25 12:27:44.885642: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2019-01-25 12:27:44.886134: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-25 12:27:44.886186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-25 12:27:44.886229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-01-25 12:27:44.886246: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-01-25 12:27:44.886538: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15099 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
2019-01-25 12:27:45.549967: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-01-25 12:27:45.550043: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-01-25 12:27:46.139548: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:952] Engine my_trt_op_0 creation for segment 0, composed of 15 nodes succeeded.
2019-01-25 12:27:46.463974: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-01-25 12:27:46.468839: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-01-25 12:27:46.469590: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph
2019-01-25 12:27:46.469620: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 809 nodes (-196), 844 edges (-196), time = 133.293ms.
2019-01-25 12:27:46.469662: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Graph size after: 825 nodes (16), 846 edges (2), time = 54.398ms.
2019-01-25 12:27:46.469676: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 811 nodes (-14), 833 edges (-13), time = 785.82ms.
2019-01-25 12:27:46.469687: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 797 nodes (-14), 833 edges (0), time = 71.909ms.
2019-01-25 12:27:46.469696: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 797 nodes (0), 833 edges (0), time = 232.407ms.
2019-01-25 12:27:46.469705: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment
2019-01-25 12:27:46.469729: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 17 nodes (0), 15 edges (0), time = 6.53ms.
2019-01-25 12:27:46.469753: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Graph size after: 17 nodes (0), 15 edges (0), time = 1.818ms.
2019-01-25 12:27:46.469766: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 17 nodes (0), 15 edges (0), time = 0.152ms.
2019-01-25 12:27:46.469785: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 17 nodes (0), 15 edges (0), time = 4.705ms.
2019-01-25 12:27:46.469793: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 17 nodes (0), 15 edges (0), time = 0.118ms.
2019-01-25 12:28:06.229814: E tensorflow/core/common_runtime/executor.cc:624] Executor failed to create kernel. Invalid argument: Default MaxPoolingOp only supports NHWC on device type CPU
[[{{node import/max_pooling2d_1/MaxPool}} = MaxPool[T=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="VALID", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:GPU:0"](import/activation_3/Relu)]]
It takes almost 20 seconds to inference 500 images with shape 299x299x3.
But if i choose max_batch_size=1 i don’t have any error and inference time is significantly lower, 6 seconds.
What could be wrong?
NVIDIA GPU - Tesla P100
CUDA V9.1.85
cudNN version 7
TensorRT version 4.0.1
TensorFlow version 1.12.0-dev20181012