problem with TFTRT

Hello, guys.

I have a problem with tensorRT and its implementation in TensorFlow.
I’m trying to create tensorRT engine for inception_v3 with TFTRT
My code:

import os
import time
import pickle

import cv2
import numpy as np
import tensorflow as tf
import matplotlib.image as mpl
import tensorflow.contrib.tensorrt as trt
from keras import backend as K
from keras.models import load_model
from sklearn.metrics import mean_squared_error


# load and prepare images
res = []
for image in sorted(list(os.walk('/images/2019_01_23/'))[0][2]):

    loaded_image = mpl.imread('/images/2019_01_23/' + image)
    res.append(cv2.resize(loaded_image ,(299,299))[:,:,:3].reshape(299,299,3)/128.0-1)
    
res = np.array(res)


#load keras model
file_path = '/nn.hdf5'

sess = tf.Session()
K.set_session(sess)

K.set_learning_phase(0)
model = load_model(file_path, compile=False)
K.set_learning_phase(0)

output_name = model.output.op.name
input_name = model.input.op.name
graph_def = tf.graph_util.remove_training_nodes(
	tf.graph_util.convert_variables_to_constants(
		sess, 
		sess.graph.as_graph_def(), 
		[output_name]
	)
)


batch_size = 500

data = res[:batch_size].astype(np.float32)

# Inference with TF-TRT frozen graph workflow:
graph = tf.Graph()
with graph.as_default():
    with tf.Session() as sess:
        # Now you can create a TensorRT inference graph from your
        # frozen graph:
        trt_graph = trt.create_inference_graph(
            input_graph_def=graph_def,
            outputs=['concatenate_9/concat'],
            max_batch_size=500,
            max_workspace_size_bytes=8589934592,
            precision_mode='FP32'
        )
        # Import the TensorRT graph into a new graph and run:
        output_node = tf.import_graph_def(
            trt_graph,
#             {'input_1':  tf.convert_to_tensor(data)},
            {'input_1': data},
            return_elements=['concatenate_9/concat']
        )
        
        t = time.time()
        val = sess.run([output_node[0].outputs[0]])
        print(time.time() - t)

In logs i see following error

2019-01-25 12:27:44.218551: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-25 12:27:44.218625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-25 12:27:44.218645: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-01-25 12:27:44.218669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-01-25 12:27:44.218975: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15099 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
2019-01-25 12:27:44.882986: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 1
2019-01-25 12:27:44.885642: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2019-01-25 12:27:44.886134: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-25 12:27:44.886186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-25 12:27:44.886229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-01-25 12:27:44.886246: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-01-25 12:27:44.886538: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15099 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
2019-01-25 12:27:45.549967: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2019-01-25 12:27:45.550043: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2019-01-25 12:27:46.139548: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:952] Engine my_trt_op_0 creation for segment 0, composed of 15 nodes succeeded.
2019-01-25 12:27:46.463974: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-01-25 12:27:46.468839: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2019-01-25 12:27:46.469590: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph
2019-01-25 12:27:46.469620: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 809 nodes (-196), 844 edges (-196), time = 133.293ms.
2019-01-25 12:27:46.469662: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 825 nodes (16), 846 edges (2), time = 54.398ms.
2019-01-25 12:27:46.469676: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 811 nodes (-14), 833 edges (-13), time = 785.82ms.
2019-01-25 12:27:46.469687: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 797 nodes (-14), 833 edges (0), time = 71.909ms.
2019-01-25 12:27:46.469696: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 797 nodes (0), 833 edges (0), time = 232.407ms.
2019-01-25 12:27:46.469705: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment
2019-01-25 12:27:46.469729: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 17 nodes (0), 15 edges (0), time = 6.53ms.
2019-01-25 12:27:46.469753: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 17 nodes (0), 15 edges (0), time = 1.818ms.
2019-01-25 12:27:46.469766: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 17 nodes (0), 15 edges (0), time = 0.152ms.
2019-01-25 12:27:46.469785: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 17 nodes (0), 15 edges (0), time = 4.705ms.
2019-01-25 12:27:46.469793: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 17 nodes (0), 15 edges (0), time = 0.118ms.
2019-01-25 12:28:06.229814: E tensorflow/core/common_runtime/executor.cc:624] Executor failed to create kernel. Invalid argument: Default MaxPoolingOp only supports NHWC on device type CPU
	 [[{{node import/max_pooling2d_1/MaxPool}} = MaxPool[T=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="VALID", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:GPU:0"](import/activation_3/Relu)]]

It takes almost 20 seconds to inference 500 images with shape 299x299x3.
But if i choose max_batch_size=1 i don’t have any error and inference time is significantly lower, 6 seconds.
What could be wrong?

NVIDIA GPU - Tesla P100
CUDA V9.1.85
cudNN version 7
TensorRT version 4.0.1
TensorFlow version 1.12.0-dev20181012

Hello,

It looks like it’s complaining about an unsupported data format on the CPU. Can you verify you are working with tensorflow-gpu?

2019-01-25 12:28:06.229814: E tensorflow/core/common_runtime/executor.cc:624] Executor failed to create kernel. Invalid argument: Default MaxPoolingOp only supports NHWC on device type CPU
	 [[{{node import/max_pooling2d_1/MaxPool}} = MaxPool[T=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="VALID", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:GPU:0"](import/activation_3/Relu)]]

Hello.

Yes. i probably working with GPU.

I run a tf.test.is_gpu_available() method and got True.

Or how can i verify that my tensorRT working with cpu, without GPU support?

to help us debug, can you share a small repro containing the keras model and sample input dataset that demonstrate the errors and performance issues you are seeing?

Also, can you try it with TensorRT 5.0.2, which contains many fixes since TRT4.