I used a TF-TRT to create the tensorrt engine and inference it compare with tensorflow model in jupyter notebook:

converter = trt.TrtGraphConverter(

input_graph_def=frozen_graph,

nodes_blacklist=your_outputs, #output nodes

max_batch_size=10,

is_dynamic_op=True,

max_workspace_size_bytes=trt.DEFAULT_TRT_MAX_WORKSPACE_SIZE_BYTES,

precision_mode=trt.TrtPrecisionMode.FP32,

minimum_segment_size=1,

maximum_cached_engines=100)

trt_graph = converter.convert()

with open("/home/user/tensor/test/phone001.trt.pb", ‘wb’) as f:

f.write(trt_graph.SerializeToString())

and I inference the model use these code:

input_img = np.random.random((1,64,64,1))

def read_pb_graph(model):

with gfile.FastGFile(model,‘rb’) as f:

graph_def = tf.GraphDef()

graph_def.ParseFromString(f.read())

return graph_def

graph = tf.Graph()

with graph.as_default():

with tf.Session(config=tf.ConfigProto(gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.5,allow_growth=True))) as sess:

trt_graph = read_pb_graph(TENSORRT_MODEL_PATH)

tf.import_graph_def(trt_graph, name=’’)

input = sess.graph.get_tensor_by_name(‘input_data:0’)

output = sess.graph.get_tensor_by_name(‘out_soft_2/truediv:0’)

total_time = 0; n_time_inference = 50

out_pred = sess.run(output, feed_dict={input: input_img})

for i in range(n_time_inference):

t1 = time.time()

out_pred = sess.run(output, feed_dict={input: input_img})

t2 = time.time()

delta_time = t2 - t1

total_time += delta_time

print(“needed time in inference-” + str(i) + ": ", delta_time)

avg_time_tensorRT = total_time / n_time_inference

print("average inference time: ", avg_time_tensorRT)

Because my input is undefined (which is (1, 64, ?, 3)), thus I set is_dynamic_op=True to build the engine during runtime when inference happen, and I also set the maximum_cached_engines=100 to store the TRT engine.

But every time I did the inference with same input size , the code has to build the tensorrt engine every times which really costs a lot of time, it seems that when I try to do the same inference at second time, the tensorRT did not use the engine which stored in LRU cache at the first inference time? or my store method in first inference time has some problem?

thanks for your advance.