TensorRT4 and TF1.12, Python - runtime difference between savedmodel and checkpoints frozen graph

I have changed facenet MTCNN and Inception-ResNet v1 to TensorRT graph successfully with tensorRT4, see https://github.com/JerryJiaGit/facenet_trt

But I found some strange issue with checkpoints frozen graph convert. The runtime difference between savedmodel and checkpoints frozen graph. I can see 30% improvement with savedmodel, but no improvment with frozen graph.

Also I compared the frozen graph with SavedModel, looks no different. So not sure if there is some known issue in the tensorRT?

Also post codes for both savedmodel and checkpoints convert for quick reference:

if (os.path.isfile(model_exp)):
        print('Model filename: %s' % model_exp)
        with gfile.FastGFile(model_exp,'rb') as f:
            graph_def = tf.GraphDef()
            graph_def.ParseFromString(f.read())
            #JJia TensorRT enable
            print('TensorRT Enabled', 2 << 20)
            trt_graph = trt.create_inference_graph(input_graph_def=graph_def,
            outputs=['embeddings:0'],
            max_batch_size = 1, 
            max_workspace_size_bytes= 2 << 20, # 2GB mem assgined to TRT
            precision_mode="FP16",  # Precision "FP32","FP16" or "INT8"                                        
            minimum_segment_size=1
            )
            #trt_graph=trt.calib_graph_to_infer_graph(trt_graph)
            tf.import_graph_def(trt_graph, input_map=input_map, name='')

else:
        print('Model directory: %s' % model_exp)
        meta_file, ckpt_file = get_model_filenames(model_exp)
        
        print('Metagraph file: %s' % meta_file)
        print('Checkpoint file: %s' % ckpt_file)
      
        saver = tf.train.import_meta_graph(os.path.join(model_exp, meta_file), input_map=input_map)
        saver.restore(tf.get_default_session(), os.path.join(model_exp, ckpt_file))
        #JJia TensorRT enable
        print('TensorRT Enabled', 2<<20)
        frozen_graph = tf.graph_util.convert_variables_to_constants(
            tf.get_default_session(),
            tf.get_default_graph().as_graph_def(),
            output_node_names=["embeddings"])
        for node in frozen_graph.node:
          if node.op == 'RefSwitch':
            node.op = 'Switch'
            #for index in range(len(node.input)):
            #  node.input[index] = node.input[index] + '/read'
          elif node.op == 'AssignSub':
            node.op = 'Sub'
            if 'use_locking' in node.attr: del node.attr['use_locking']
        trt_graph = trt.create_inference_graph(
            input_graph_def=frozen_graph,
            outputs=["embeddings"],
            max_batch_size = 1,
            max_workspace_size_bytes= 2 << 20,
            precision_mode="FP16",                                       
            minimum_segment_size=1)
        tf.import_graph_def(trt_graph,return_elements=["embeddings:0"])

Hello,

we are triaging the issue and will you know what we find. Have you tried it with latest TRT5.x?

Thanks for following up.

No, I don’t have a chance to try TRT5.x yet.

Hello,

Engineering cannot find the checkpoint file or any frozen graph data based on current infomation.
Could you help to ask the user to provide these data?

Hi Team,
I am using https://github.com/davidsandberg/facenet and its pre-train model. So saved model link is:

https://drive.google.com/open?id=1R77HmFADxe87GmoLwzfgMu_HY0IhcyBz

Model name Training dataset Architecture
20180408-102900 CASIA-WebFace Inception ResNet v1

It includes .pb saved model and .ckpt/.meta for checkpoints graph.

Then to enable TRT, just need clone https://github.com/JerryJiaGit/facenet_trt and overwrite same facenet.py file. And also you can modify to use saved model or checkpoints from face.py file (first few lines). For align folder, it is for MTCNN, not neccessary for this issue debug.

BTW:
I have also tried Jetson Xavier with TRT5, CUDA10. Python TRT code is working on it, but from the result, runtime has no any improvement with both saved model and checkpoints graph. I will try x86 TRT5 later.

Thanks,.
Jerry

I have tried same code on Xavier L4T 4.1.1 and TensorRT5, same repro for ckpt no improvement.
Also I tried TensorRT5 on another x86 machine, same repro for ckpt no improvement.

But I can see TRT runtime improvement with SavedModel.

Hello,

Per engineering:

This doesn’t seem to be TensorRT related.

  1. Without using TensorRT, just using TensorFlow itself, the runtime is different between frozen_graph/checkpoint.

    And that’s because the graph structure is slightly different. User can check the graph structure by :

graph_def = tf.get_default_graph().as_graph_def()
for node in graph_def.node:
print(node.name, node.op)

```



2. user's use case may have issues.  Modified user's facenet.py and face.py from the gitlab repro  https://github.com/JerryJiaGit/facenet_trt.
    Add attached the modifed version here in (user_patch.tar)  Can see there is no much difference when using TensorRT like these. 
The usage is:
     a. extract the face.py/facenet.py/replay.sh/replay.py from the .tar, in the git repo.
     b. source ./replay.sh, then user could see the runtime difference is very small, and mostly by Tensorflow difference. 

     Basically the idea is, clean the default graph before use TensorRT, this will reduce the graph difference. 

```
```python
 class Encoder:
-     def __init__(self):
+    def __init__(self, model_checkpoint, use_trt):
         gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=gpu_memory_fraction)
         #gpu_options = tf.GPUOptions(allow_growth=True)
         self.sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options, log_device_placement=False))
         #sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options, log_device_placement=False))
         with self.sess.as_default():
-            facenet.load_model(facenet_model_checkpoint)
+            graph_def = facenet.load_model(model_checkpoint, use_trt=False)
+            #graph_def = self.sess.graph.as_graph_def()
+            print("[NODE] size", len(graph_def.node))
+            for node in graph_def.node:
+                print("[NODE] ",  node.name, node.op)
+            sys.stdout.flush()
+
+        if use_trt:
+            self.sess.close()
+            tf.reset_default_graph()
+            self.sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options, log_device_placement=False))
+            for node in graph_def.node:
+              if node.op == 'RefSwitch':
+                node.op = 'Switch'
+                #for index in range(len(node.input)):
+                #  node.input[index] = node.input[index] + '/read'
+              elif node.op == 'AssignSub':
+                node.op = 'Sub'
+                if 'use_locking' in node.attr: del node.attr['use_locking']
+
+            print('TensorRT Enabled', 2 << 20)
+            trt_graph = trt.create_inference_graph(input_graph_def=graph_def,
+                outputs=['embeddings:0'],
+                max_batch_size = 1,
+                max_workspace_size_bytes= 2 << 20, # 2GB mem assgined to TRT
+                precision_mode="FP16",  # Precision "FP32","FP16" or "INT8"
+                minimum_segment_size=1
+            )
+            tf.import_graph_def(trt_graph, input_map=None, name='')
```
```



3. Can you provide more details on how you got the perf number like the reported ones?
 
Face Detect Network	Avg Time
original network	41.948318 ms
tensorrt network FP32	41.948318 ms
tensorrt network FP16	42.028268 ms  
Face Identify Network	Avg Time
original network	13.713258 ms
tensorrt network FP32	11.296281 ms
tensorrt network FP16	10.54711 ms

It's by python "time" module of something else? 

Thanks.
[user_patch.tar|attachment](upload://4ilCb36fHrUUPVW0xKzenCON8e2.tar) (40 KB)

Thank you so much for your detailed reply and testing. Really appreciate!
Because facenet is very popular for basic face recognition tasks, so I am really want to leverage the performance boost with TensorRT with minimized changes based on its py code.

Comment for #1:
Yes, a little different between ckpt/meta and pb savedmodel, in fact the only different here is savedmodel has addtion node “label_batch Identity” at start. But I don’t think that could make much difference with TensorRT graph convert. In fact, I noted the runtime difference between ckpt/meta and pb savedmodel without tensorRT, but the difference is not too much (~5%). But with TensorRT convert, the difference is too much (~20%).

Comment for #2:
I am checking and testing, will reply later.

Comment for #3:
Yes, it is “time” module with limited precision. Because the run time is about xx ms, I believe it is still okay to use with several loops avg result.

Hi NVIDIA team,
Thank you so much for your great help and tests. Yes, your modified code is working well with similar run-time performance with ckpt/meta graph or SavedModel.

Yes, the issue is not related to TenorRT.

After some study, I believe I understand where the problem is from. Something strange after tensorflow convert_variables_to_constants(), the graph is not able to be updated, have to reset and re-start new sess for new graph load.

https://github.com/JerryJiaGit/facenet_trt updated with a workaround.

Thank you again!
Jerry