TF-TRT, why have to create TensorRT engine every time of inference ?

guozr123 · December 13, 2019, 5:48am

I used a TF-TRT to create the tensorrt engine and inference it compare with tensorflow model in jupyter notebook:

converter = trt.TrtGraphConverter(
input_graph_def=frozen_graph,
nodes_blacklist=your_outputs, #output nodes
max_batch_size=10,
is_dynamic_op=True,
max_workspace_size_bytes=trt.DEFAULT_TRT_MAX_WORKSPACE_SIZE_BYTES,
precision_mode=trt.TrtPrecisionMode.FP32,
minimum_segment_size=1,
maximum_cached_engines=100)
trt_graph = converter.convert()
with open(“/home/user/tensor/test/phone001.trt.pb”, ‘wb’) as f:
f.write(trt_graph.SerializeToString())

and I inference the model use these code:

input_img = np.random.random((1,64,64,1))

def read_pb_graph(model):
with gfile.FastGFile(model,‘rb’) as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
return graph_def

graph = tf.Graph()
with graph.as_default():
with tf.Session(config=tf.ConfigProto(gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.5,allow_growth=True))) as sess:
trt_graph = read_pb_graph(TENSORRT_MODEL_PATH)
tf.import_graph_def(trt_graph, name=‘’)
input = sess.graph.get_tensor_by_name(‘input_data:0’)
output = sess.graph.get_tensor_by_name(‘out_soft_2/truediv:0’)
total_time = 0; n_time_inference = 50
out_pred = sess.run(output, feed_dict={input: input_img})
for i in range(n_time_inference):
t1 = time.time()
out_pred = sess.run(output, feed_dict={input: input_img})
t2 = time.time()
delta_time = t2 - t1
total_time += delta_time
print(“needed time in inference-” + str(i) + ": ", delta_time)
avg_time_tensorRT = total_time / n_time_inference
print("average inference time: ", avg_time_tensorRT)

Because my input is undefined (which is (1, 64, ?, 3)), thus I set is_dynamic_op=True to build the engine during runtime when inference happen, and I also set the maximum_cached_engines=100 to store the TRT engine.
But every time I did the inference with same input size , the code has to build the tensorrt engine every times which really costs a lot of time, it seems that when I try to do the same inference at second time, the tensorRT did not use the engine which stored in LRU cache at the first inference time? or my store method in first inference time has some problem?

thanks for your advance.

SunilJB · December 18, 2019, 7:10am

Hi,

Few ways you can optimize the initialization:
1)Use opt_profiles in TF-TRT:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt-601/tensorrt-developer-guide/index.html#opt_profiles
2)You can also make all the shapes known beforehand and then use is_dynamic_op=False

In TF-TRT v1, you can also use the save() method to serialize the engines but that only works if you use SavedModel format.
https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#integrate-ovr

Thanks

guozr123 · December 20, 2019, 4:25am

hello, thanks for you reply
but I have no idea how to add these opt_profiles code into TF-TRT
could you give me some example of that?
my code as below:

with tf.Session() as sess:
# First deserialize your frozen graph:
with tf.gfile.GFile(“/home/user/tensor/test/model_path”, ‘rb’) as f:
frozen_graph = tf.GraphDef()
frozen_graph.ParseFromString(f.read())
your_outputs = [“output_name:0”]

convert (optimize) frozen model to TensorRT model

converter = trt.TrtGraphConverter(
input_graph_def=frozen_graph,
nodes_blacklist=your_outputs, #output nodes
max_batch_size=32,
is_dynamic_op=False,
max_workspace_size_bytes=trt.DEFAULT_TRT_MAX_WORKSPACE_SIZE_BYTES,
precision_mode=trt.TrtPrecisionMode.FP32,
minimum_segment_size=1,
maximum_cached_engines=10)
trt_graph = converter.convert()

trt_engine_opts = len([1 for n in trt_graph.node if str(n.op) == ‘TRTEngineOp’])
print(“trt_engine_opts = {}”.format(trt_engine_opts))

with open(“/home/user/tensor/test/trtmodel_path”, ‘wb’) as f:
#f.write(engine.serialize())
f.write(trt_graph.SerializeToString())

print(“TensorRT model is successfully stored!”)

SunilJB · December 20, 2019, 7:11am

Hi,

Please follow below steps to check if your entire model converts to TensorRT.
https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#tensorrt-plan

Otherwise you need to use the “converter.save(output_saved_model_dir=output_saved_model_dir)” option to save the generated optimized function.
https://github.com/tensorflow/tensorrt/blob/master/tftrt/examples/object_detection/object_detection.py#L118

Another alternative is to convert your model to ONNX instead using tf2onnx and then convert to TensorRT using ONNX parser. Any layer that are not supported needs to be replaced by custom plugin.
https://github.com/onnx/tensorflow-onnx
https://github.com/onnx/onnx-tensorrt/blob/master/operators.md

Please refer below C++ sample for dynamic shapes:
https://github.com/NVIDIA/TensorRT/blob/07ed9b57b1ff7c24664388e5564b17f7ce2873e5/samples/opensource/sampleDynamicReshape/sampleDynamicReshape.cpp

Thanks

ra_vo · December 21, 2019, 6:43pm

I would like to point out that in your guide:
https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#tensorrt-plan

You presume that the user uses

trtgraph

from

create_inference_graph

, however the documentation states that we need to use TrtgraphconverterV2:

"The TF-TRT API in TensorFlow 2.0 has a number of differences with the API in TensorFlow 1.x which are explained here. The name of the Python class has the suffix V2, i.e. TrtGraphConverterV2. "

This function returns a graph function and not a graph definition. The documentation has some contradictions. Can you give me some pointers on how to get the correct

trt_graph

to create a tensorRT inference plan?

Container 19.11, tf2, py3

SunilJB · December 23, 2019, 5:36am

Hi,

Save model file using TF-TRT 1.x
https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#using-savedmodel

Save model file using TF-TRT 2.0
https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#worflow-with-savedmodel

You can refer below link for differences with the API in TensorFlow 1.x
https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#tf-trt-api-20

You can refer below link to generate the serialize plan file when all the model operation are optimized to TRT:
https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#tensorrt-plan

Thanks

ra_vo · December 23, 2019, 3:00pm

This is exactly what I said that does not work.
The current workflow of TF-TRT2.0 gives a graphFunction, but the serialization needs a graphDef. Please update the documentation.

ra_vo · December 23, 2019, 3:52pm

I have found the following solution to

Export TF2.0 SavedModel to TensorRT-plan

Define a model in TF2 and train it
Save the model with ``` original_saved_model_dir = "./tmp/unet/1/"
tf.saved_model.save(unet_serial,
original_saved_model_dir,
signatures=None)
```
</li>
<li>Perform conversion to FP32/FP16 
```
output_saved_model_dir_FP32 = ‘./tmp/unet-FP32/1/’

print(‘Converting to TF-TRT FP32…’)
conversion_params = trt.DEFAULT_TRT_CONVERSION_PARAMS._replace(precision_mode=trt.TrtPrecisionMode.FP32,
max_workspace_size_bytes=300000000)

converter = trt.TrtGraphConverterV2(input_saved_model_dir=original_saved_model_dir,
conversion_params=conversion_params)

print(‘Done Converting to TF-TRT FP32’)

converter.convert()
converter.build(input_fn=partial(input_fn,1))
converter.save(output_saved_model_dir=output_saved_model_dir_FP32)
```
</li>
<li>Now you are going to need an updated convert_to_constants file. If you are using the nvidia tensorRt container 19.11, you can use the following commands (inside jupyter):
```
!cp /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/convert_to_constants.py /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/convert_to_constants.py.backup
!curl https://raw.githubusercontent.com/tensorflow/tensorflow/master/tensorflow/python/framework/convert_to_constants.py -o /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/convert_to_constants.py
```
Then use the new version of <i>convert_to_constants</i> that also returns the graph
(This is missing from the tensorRT docs)
```
saved_model_loaded = tf.saved_model.load(
output_saved_model_dir_FP32, tags=[tag_constants.SERVING])
graph_func = saved_model_loaded.signatures[
signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
frozen_function, frozen_graphdef = convert_to_constants.convert_variables_to_constants_v2_as_graph(graph_func)
```
</li>

<li>Only then can you use:
```
for n in frozen_graphdef.node:
if n.op == “TRTEngineOp”:
print(“Node: %s, %s” % (n.op, n.name.replace(“/”, ““)))
with tf.io.gfile.GFile(”%s.plan" % (n.name.replace(“/”, "”)), ‘wb’) as f:
f.write(n.attr[“serialized_segment”].s)
else:
print(“Exclude Node: %s, %s” % (n.op, n.name.replace(“/”, “_”)))
```
</li>
</ol>

Appendix

Helper function:
```
from functools import partial

def input_fn(num_iterations):
for i, (batch_images, _) in enumerate(train_ds):
if i >= num_iterations:
break
yield (batch_images,)
print(" step %d/%d" % (i+1, num_iterations))
i += 1
```
These were some important imports:
```
from tensorflow.python.compiler.tensorrt import trt_convert as trt
from tensorflow.python.saved_model import tag_constants
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.framework import convert_to_constants
```
Extra tip: you can use frozen_function to do some quick inference:
```
import cProfile
import pstats

cProfile.run(‘frozen_function(images)’,‘prof’)
p = pstats.Stats(‘prof’)
p.strip_dirs().sort_stats(‘cumulative’).print_stats()