But I found some strange issue with checkpoints frozen graph convert. The runtime difference between savedmodel and checkpoints frozen graph. I can see 30% improvement with savedmodel, but no improvment with frozen graph.
Also I compared the frozen graph with SavedModel, looks no different. So not sure if there is some known issue in the tensorRT?
Also post codes for both savedmodel and checkpoints convert for quick reference:
if (os.path.isfile(model_exp)):
print('Model filename: %s' % model_exp)
with gfile.FastGFile(model_exp,'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
#JJia TensorRT enable
print('TensorRT Enabled', 2 << 20)
trt_graph = trt.create_inference_graph(input_graph_def=graph_def,
outputs=['embeddings:0'],
max_batch_size = 1,
max_workspace_size_bytes= 2 << 20, # 2GB mem assgined to TRT
precision_mode="FP16", # Precision "FP32","FP16" or "INT8"
minimum_segment_size=1
)
#trt_graph=trt.calib_graph_to_infer_graph(trt_graph)
tf.import_graph_def(trt_graph, input_map=input_map, name='')
else:
print('Model directory: %s' % model_exp)
meta_file, ckpt_file = get_model_filenames(model_exp)
print('Metagraph file: %s' % meta_file)
print('Checkpoint file: %s' % ckpt_file)
saver = tf.train.import_meta_graph(os.path.join(model_exp, meta_file), input_map=input_map)
saver.restore(tf.get_default_session(), os.path.join(model_exp, ckpt_file))
#JJia TensorRT enable
print('TensorRT Enabled', 2<<20)
frozen_graph = tf.graph_util.convert_variables_to_constants(
tf.get_default_session(),
tf.get_default_graph().as_graph_def(),
output_node_names=["embeddings"])
for node in frozen_graph.node:
if node.op == 'RefSwitch':
node.op = 'Switch'
#for index in range(len(node.input)):
# node.input[index] = node.input[index] + '/read'
elif node.op == 'AssignSub':
node.op = 'Sub'
if 'use_locking' in node.attr: del node.attr['use_locking']
trt_graph = trt.create_inference_graph(
input_graph_def=frozen_graph,
outputs=["embeddings"],
max_batch_size = 1,
max_workspace_size_bytes= 2 << 20,
precision_mode="FP16",
minimum_segment_size=1)
tf.import_graph_def(trt_graph,return_elements=["embeddings:0"])
Engineering cannot find the checkpoint file or any frozen graph data based on current infomation.
Could you help to ask the user to provide these data?
BTW:
I have also tried Jetson Xavier with TRT5, CUDA10. Python TRT code is working on it, but from the result, runtime has no any improvement with both saved model and checkpoints graph. I will try x86 TRT5 later.
I have tried same code on Xavier L4T 4.1.1 and TensorRT5, same repro for ckpt no improvement.
Also I tried TensorRT5 on another x86 machine, same repro for ckpt no improvement.
But I can see TRT runtime improvement with SavedModel.
Without using TensorRT, just using TensorFlow itself, the runtime is different between frozen_graph/checkpoint.
And that’s because the graph structure is slightly different. User can check the graph structure by :
graph_def = tf.get_default_graph().as_graph_def()
for node in graph_def.node:
print(node.name, node.op)
```
2. user's use case may have issues. Modified user's facenet.py and face.py from the gitlab repro https://github.com/JerryJiaGit/facenet_trt.
Add attached the modifed version here in (user_patch.tar) Can see there is no much difference when using TensorRT like these.
The usage is:
a. extract the face.py/facenet.py/replay.sh/replay.py from the .tar, in the git repo.
b. source ./replay.sh, then user could see the runtime difference is very small, and mostly by Tensorflow difference.
Basically the idea is, clean the default graph before use TensorRT, this will reduce the graph difference.
```
```python
class Encoder:
- def __init__(self):
+ def __init__(self, model_checkpoint, use_trt):
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=gpu_memory_fraction)
#gpu_options = tf.GPUOptions(allow_growth=True)
self.sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options, log_device_placement=False))
#sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options, log_device_placement=False))
with self.sess.as_default():
- facenet.load_model(facenet_model_checkpoint)
+ graph_def = facenet.load_model(model_checkpoint, use_trt=False)
+ #graph_def = self.sess.graph.as_graph_def()
+ print("[NODE] size", len(graph_def.node))
+ for node in graph_def.node:
+ print("[NODE] ", node.name, node.op)
+ sys.stdout.flush()
+
+ if use_trt:
+ self.sess.close()
+ tf.reset_default_graph()
+ self.sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options, log_device_placement=False))
+ for node in graph_def.node:
+ if node.op == 'RefSwitch':
+ node.op = 'Switch'
+ #for index in range(len(node.input)):
+ # node.input[index] = node.input[index] + '/read'
+ elif node.op == 'AssignSub':
+ node.op = 'Sub'
+ if 'use_locking' in node.attr: del node.attr['use_locking']
+
+ print('TensorRT Enabled', 2 << 20)
+ trt_graph = trt.create_inference_graph(input_graph_def=graph_def,
+ outputs=['embeddings:0'],
+ max_batch_size = 1,
+ max_workspace_size_bytes= 2 << 20, # 2GB mem assgined to TRT
+ precision_mode="FP16", # Precision "FP32","FP16" or "INT8"
+ minimum_segment_size=1
+ )
+ tf.import_graph_def(trt_graph, input_map=None, name='')
```
```
3. Can you provide more details on how you got the perf number like the reported ones?
Face Detect Network Avg Time
original network 41.948318 ms
tensorrt network FP32 41.948318 ms
tensorrt network FP16 42.028268 ms
Face Identify Network Avg Time
original network 13.713258 ms
tensorrt network FP32 11.296281 ms
tensorrt network FP16 10.54711 ms
It's by python "time" module of something else?
Thanks.
[user_patch.tar|attachment](upload://4ilCb36fHrUUPVW0xKzenCON8e2.tar) (40 KB)
Thank you so much for your detailed reply and testing. Really appreciate!
Because facenet is very popular for basic face recognition tasks, so I am really want to leverage the performance boost with TensorRT with minimized changes based on its py code.
Comment for #1:
Yes, a little different between ckpt/meta and pb savedmodel, in fact the only different here is savedmodel has addtion node “label_batch Identity” at start. But I don’t think that could make much difference with TensorRT graph convert. In fact, I noted the runtime difference between ckpt/meta and pb savedmodel without tensorRT, but the difference is not too much (~5%). But with TensorRT convert, the difference is too much (~20%).
Comment for #2:
I am checking and testing, will reply later.
Comment for #3:
Yes, it is “time” module with limited precision. Because the run time is about xx ms, I believe it is still okay to use with several loops avg result.
Hi NVIDIA team,
Thank you so much for your great help and tests. Yes, your modified code is working well with similar run-time performance with ckpt/meta graph or SavedModel.
Yes, the issue is not related to TenorRT.
After some study, I believe I understand where the problem is from. Something strange after tensorflow convert_variables_to_constants(), the graph is not able to be updated, have to reset and re-start new sess for new graph load.