TensorRT (TF-TRT) doesn't improve TF model in GeForce 1060?

Hi,

I have a (i) Keras model, then save it to (ii) Tensorflow model. Based on TF-TRT workflow, I convert (ii) to frozen model, followed by optimizing it to trt_graph as follows.

trt_graph = trt.create_inference_graph(
            input_graph_def=frozen_graph,
            outputs=your_outputs,
            max_batch_size=2,
            max_workspace_size_bytes=1<<30,
            precision_mode="FP32")

However, there is almost no improvement compared to when I infer the model using standard TF model (native_tf_graph). Both trt_graph and native_tf_graph have inference time of 0.12s in GeForce 1060 per 2 images.

Detailed info:

  1. input image: 28x28 gray MNIST image
  2. Network: trained in Keras, convert to Tensorflow model. Depth of conv1=conv2=conv3=2000, conv4=500, with 1024 nodes in FC-layers. I intentionally make it very deep model so that I expect to see a big enough improvement in TRT runtime engine.
  3. Model size: 1.2Gb (for the tensorflow weight model)
  4. OS: Ubuntu 16.04 with Geforce 1060 GPU

Any suggestion how I can get inference time improvement?

I’ll need to reproduce this to see what’s going on. Can you send us the performance code for TensorFlow and TensorRT along with the model (.pb file)? You can DM me if you don’t want to post it publicly.

Thanks for the reply. Please find the code, dataset and trained model here:
https://drive.google.com/drive/folders/1Bz7rh5Ku2GGJAYMI7RRETyIlbzdQggdf?usp=sharing

The trained model is in a separated zip file with the code+dataset due to its big size. You can either download the trained model or retrain the model. All the instructions are provided in the README.txt in the “Code&Dataset.zip”. The code (network) is only for testing the ‘Keras_to_TRT’ performance, not a confidential code, so, I put it publicly in the above link and others may also try the similar thing :)

Looking forward to hearing you about the reason why there is almost no improvement in TRT model.

Many thanks.

I suspect something may not be GPU-enabled in your setup. Are you using the containers from our registry? I started off using the nvcr.io/nvidia/tensorrt:18.08-py3 image from our container registry.

I notice that it’s “Collecting tensorflow (from -r /opt/tensorrt/samples/python/uff_custom_plugin/requirements.txt (line 4))” during the python initialization

python 3_inference_using_TF_model.py
results:
average inference time:  0.13442354679107665

python 4_inference_using_TensorRT_model.py 
reports: 
2018-12-21 21:48:03.201889: I tensorflow/core/grappler/devices.cc:51] <b>Number of eligible GPUs (core count >= 8): 0</b>

results:
average inference time:  0.1432429075241089

===========
Switching over to tensorrt:18.08-py3 I see that it’s “Collecting tensorflow-gpu (from -r /opt/tensorrt/python/requirements.txt (line 1))” and re-running gets me:

python 3_inference_using_TF_model.py
results
average inference time:  0.10110781669616699

python 4_inference_using_TensorRT_model.py
reports:
<b>Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7</b>

results:
average inference time:  0.05179745674133301

I think that’s the improvement you’re looking for. If you’re using our containers, try dropping back to 18.08. There’s an issue in the newer ones that we’ll need to sort out on our side. If you’re not using our containers, let’s take a closer look at the output you’re getting when you run your scripts.

Thanks for the reply. Yes, I did’t use container and I installed TensorRT using ‘deb file’ (https://docs.nvidia.com/deeplearning/sdk/tensorrt-install-guide/index.html#installing). However, using such installation setup, I can successfully optimize using this sample code with the provided ResNet.pb (https://developer.download.nvidia.com/devblogs/tftrt_sample.tar.xz). Here is the output log using “python 4_inference_using_TensorRT_model.py”.

2018-12-24 13:53:55.840967: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-12-24 13:53:55.933085: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-12-24 13:53:55.933542: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.759
pciBusID: 0000:01:00.0
totalMemory: 5.92GiB freeMemory: 5.15GiB
2018-12-24 13:53:55.933558: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-24 13:53:56.301501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-24 13:53:56.301531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-12-24 13:53:56.301538: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-12-24 13:53:56.301722: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4890 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
WARNING:tensorflow:From /home/cvrc/development_dir/Keras2TRT/4_inference_using_TensorRT_model.py:33: FastGFile.__init__ (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.gfile.GFile.
2018-12-24 13:54:04.884505: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 1
2018-12-24 13:54:04.884685: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2018-12-24 13:54:04.884993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-24 13:54:04.885030: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-24 13:54:04.885046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-12-24 13:54:04.885051: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-12-24 13:54:04.885160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4890 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-12-24 13:54:05.894121: W tensorflow/core/framework/allocator.cc:122] Allocation of 400000000 exceeds 10% of system memory.
2018-12-24 13:54:06.075589: W tensorflow/core/framework/allocator.cc:122] Allocation of 100000000 exceeds 10% of system memory.
2018-12-24 13:54:06.123622: W tensorflow/core/framework/allocator.cc:122] Allocation of 100352000 exceeds 10% of system memory.
2018-12-24 13:54:06.171134: W tensorflow/core/framework/allocator.cc:122] Allocation of 400000000 exceeds 10% of system memory.
2018-12-24 13:54:06.171172: W tensorflow/core/framework/allocator.cc:122] Allocation of 400000000 exceeds 10% of system memory.
2018-12-24 13:54:09.360580: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:853] MULTIPLE tensorrt candidate conversion: 2
2018-12-24 13:54:09.361285: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2018-12-24 13:54:09.361300: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2018-12-24 13:54:09.368508: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2018-12-24 13:54:09.368535: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
Cuda error in file src/implicit_gemm.cu at line 585: out of memory
2018-12-24 13:54:14.911454: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger cudnnFusedConvActLayer.cpp (275) - Cuda Error in executeFused: 2
2018-12-24 13:54:14.952551: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger cudnnFusedConvActLayer.cpp (275) - Cuda Error in executeFused: 2
2018-12-24 13:54:14.994093: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 19 nodes failed: Internal: Failed to build TensorRT engine. Skipping...
2018-12-24 13:54:18.346395: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:952] Engine my_trt_op_1 creation for segment 1, composed of 10 nodes succeeded.
2018-12-24 13:54:21.612352: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2018-12-24 13:54:21.612401: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
Cuda error in file src/implicit_gemm.cu at line 585: out of memory
2018-12-24 13:54:27.243171: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger cudnnFusedConvActLayer.cpp (275) - Cuda Error in executeFused: 2
2018-12-24 13:54:27.260945: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger cudnnFusedConvActLayer.cpp (275) - Cuda Error in executeFused: 2
2018-12-24 13:54:27.307122: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 19 nodes failed: Internal: Failed to build TensorRT engine. Skipping...
2018-12-24 13:54:28.779179: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2018-12-24 13:54:30.019933: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2018-12-24 13:54:30.326941: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2018-12-24 13:54:30.589918: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2018-12-24 13:54:30.590971: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph
2018-12-24 13:54:30.590996: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 39 nodes (-7), 39 edges (-7), time = 2894.01099ms.
2018-12-24 13:54:30.591001: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 44 nodes (5), 44 edges (5), time = 284.807ms.
2018-12-24 13:54:30.591005: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 35 nodes (-9), 35 edges (-9), time = 9274.73438ms.
2018-12-24 13:54:30.591009: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 35 nodes (0), 35 edges (0), time = 1202.29102ms.
2018-12-24 13:54:30.591012: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 35 nodes (0), 35 edges (0), time = 7758.31787ms.
2018-12-24 13:54:30.591016: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment
2018-12-24 13:54:30.591019: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 20 nodes (0), 19 edges (0), time = 1233.94104ms.
2018-12-24 13:54:30.591023: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 20 nodes (0), 19 edges (0), time = 236.156ms.
2018-12-24 13:54:30.591026: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 20 nodes (0), 19 edges (0), time = 0.248ms.
2018-12-24 13:54:30.591029: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 20 nodes (0), 19 edges (0), time = 1240.49097ms.
2018-12-24 13:54:30.591033: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 20 nodes (0), 19 edges (0), time = 0.193ms.
2018-12-24 13:54:30.591036: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_1_native_segment
2018-12-24 13:54:30.591039: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 11 nodes (0), 10 edges (0), time = 257.609ms.
2018-12-24 13:54:30.591042: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 11 nodes (0), 10 edges (0), time = 48.044ms.
2018-12-24 13:54:30.591046: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 11 nodes (0), 10 edges (0), time = 0.17ms.
2018-12-24 13:54:30.591049: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 11 nodes (0), 10 edges (0), time = 262.789ms.
2018-12-24 13:54:30.591052: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 11 nodes (0), 10 edges (0), time = 0.197ms.
needed time in inference-0:  0.5824737548828125
needed time in inference-1:  0.12596654891967773
needed time in inference-2:  0.13120508193969727
needed time in inference-3:  0.12677717208862305
needed time in inference-4:  0.13308978080749512
needed time in inference-5:  0.12937235832214355
needed time in inference-6:  0.13129234313964844
needed time in inference-7:  0.12482571601867676
needed time in inference-8:  0.13338255882263184
needed time in inference-9:  0.12623810768127441
needed time in inference-10:  0.13152170181274414
needed time in inference-11:  0.12401390075683594
needed time in inference-12:  0.1333603858947754
needed time in inference-13:  0.12788605690002441
needed time in inference-14:  0.13037347793579102
needed time in inference-15:  0.12755250930786133
needed time in inference-16:  0.13016366958618164
needed time in inference-17:  0.12825989723205566
needed time in inference-18:  0.1304922103881836
needed time in inference-19:  0.12964320182800293
needed time in inference-20:  0.13305997848510742
needed time in inference-21:  0.12588143348693848
needed time in inference-22:  0.13349699974060059
needed time in inference-23:  0.12367844581604004
needed time in inference-24:  0.1296827793121338
needed time in inference-25:  0.12791967391967773
needed time in inference-26:  0.13143277168273926
needed time in inference-27:  0.12555861473083496
needed time in inference-28:  0.12971806526184082
needed time in inference-29:  0.1254715919494629
needed time in inference-30:  0.1312105655670166
needed time in inference-31:  0.12701940536499023
needed time in inference-32:  0.130279541015625
needed time in inference-33:  0.12657809257507324
needed time in inference-34:  0.13151812553405762
needed time in inference-35:  0.12312507629394531
needed time in inference-36:  0.13004088401794434
needed time in inference-37:  0.12745976448059082
needed time in inference-38:  0.1300811767578125
needed time in inference-39:  0.1262650489807129
needed time in inference-40:  0.12478852272033691
needed time in inference-41:  0.12713932991027832
needed time in inference-42:  0.12752389907836914
needed time in inference-43:  0.1264796257019043
needed time in inference-44:  0.12953758239746094
needed time in inference-45:  0.12877130508422852
needed time in inference-46:  0.1282973289489746
needed time in inference-47:  0.12683439254760742
needed time in inference-48:  0.12511801719665527
needed time in inference-49:  0.12758588790893555
average inference time:  0.13758888721466064
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root'

Currently, I’m proceeding to setup container ‘tensorrt:18.08-py3’ as you said above, while waiting your answer. In addition, I have further inquires below.

  1. Can I also use the container in Jetson TX2? Now, I’m installing it in Ubuntu Desktop.
  2. Generally, in Jetson TX2, we have two options to optimize deep learning model to TensorRT graph, using: (i) TF-TRT and (ii) using TensorRT C++ API. Do (i) and (ii) have same performance result, i.e., get same FPS from a same input model, in Jetson TX2? If not, and the difference is quite big, I will consider to use (ii) TensorRT C++ instead of (i) TF-TRT. In my case, my current model is built in Keras (I can re-write my code in Tensorflow if needed), and will be deployed in Jetson TX2.
  3. In case of I use (i) TF-TRT, is it different, in term of optimized FPS, if the model is built via Keras (then converted to Tensorflow) vs. Tensorflow?

Many thanks!

UPDATE

Running “python 4_inference_using_TensorRT_model.py” in container “tensorrt:18.08-py3” also generates a similar un-optimized result:

/usr/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 88 from C header, got 96 from PyObject
  return f(*args, **kwds)
2018-12-24 09:12:26.094171: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-12-24 09:12:26.184971: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-12-24 09:12:26.185448: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.759
pciBusID: 0000:01:00.0
totalMemory: 5.92GiB freeMemory: 5.11GiB
2018-12-24 09:12:26.185477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-24 09:12:26.543391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-24 09:12:26.543433: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-12-24 09:12:26.543441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-12-24 09:12:26.543718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4846 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
WARNING:tensorflow:From 4_inference_using_TensorRT_model.py:33: FastGFile.__init__ (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.gfile.GFile.
2018-12-24 09:12:34.054064: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 1
2018-12-24 09:12:34.054305: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2018-12-24 09:12:34.054588: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-24 09:12:34.054618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-24 09:12:34.054627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-12-24 09:12:34.054633: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-12-24 09:12:34.054744: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4846 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-12-24 09:12:38.351297: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:853] MULTIPLE tensorrt candidate conversion: 2
2018-12-24 09:12:38.351555: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2018-12-24 09:12:38.351570: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2018-12-24 09:12:38.357170: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2018-12-24 09:12:38.357196: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
Cuda error in file src/implicit_gemm.cu at line 585: out of memory
2018-12-24 09:12:42.824881: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger cudnnFusedConvActLayer.cpp (275) - Cuda Error in executeFused: 2
2018-12-24 09:12:42.846145: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger cudnnFusedConvActLayer.cpp (275) - Cuda Error in executeFused: 2
2018-12-24 09:12:42.883598: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 19 nodes failed: Internal: Failed to build TensorRT engine. Skipping...
2018-12-24 09:12:45.850951: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:952] Engine my_trt_op_1 creation for segment 1, composed of 10 nodes succeeded.
2018-12-24 09:12:49.008334: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2018-12-24 09:12:49.008373: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
Cuda error in file src/implicit_gemm.cu at line 585: out of memory
2018-12-24 09:12:54.341589: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger cudnnFusedConvActLayer.cpp (275) - Cuda Error in executeFused: 2
2018-12-24 09:12:54.360854: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger cudnnFusedConvActLayer.cpp (275) - Cuda Error in executeFused: 2
2018-12-24 09:12:54.402068: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 19 nodes failed: Internal: Failed to build TensorRT engine. Skipping...
2018-12-24 09:12:55.781084: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2018-12-24 09:12:56.934599: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2018-12-24 09:12:57.218973: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2018-12-24 09:12:57.453657: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2018-12-24 09:12:57.454507: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph
2018-12-24 09:12:57.454525: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 39 nodes (-7), 39 edges (-7), time = 2783.69ms.
2018-12-24 09:12:57.454535: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 44 nodes (5), 44 edges (5), time = 264.162ms.
2018-12-24 09:12:57.454550: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 35 nodes (-9), 35 edges (-9), time = 7763.08ms.
2018-12-24 09:12:57.454556: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 35 nodes (0), 35 edges (0), time = 1151.07605ms.
2018-12-24 09:12:57.454562: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 35 nodes (0), 35 edges (0), time = 7400.04395ms.
2018-12-24 09:12:57.454568: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment
2018-12-24 09:12:57.454574: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 20 nodes (0), 19 edges (0), time = 1154.69ms.
2018-12-24 09:12:57.454592: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 20 nodes (0), 19 edges (0), time = 222.528ms.
2018-12-24 09:12:57.454599: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 20 nodes (0), 19 edges (0), time = 0.236ms.
2018-12-24 09:12:57.454605: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 20 nodes (0), 19 edges (0), time = 1153.25696ms.
2018-12-24 09:12:57.454611: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 20 nodes (0), 19 edges (0), time = 0.264ms.
2018-12-24 09:12:57.454617: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_1_native_segment
2018-12-24 09:12:57.454623: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 11 nodes (0), 10 edges (0), time = 238.241ms.
2018-12-24 09:12:57.454629: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 11 nodes (0), 10 edges (0), time = 44.881ms.
2018-12-24 09:12:57.454635: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 11 nodes (0), 10 edges (0), time = 0.186ms.
2018-12-24 09:12:57.454641: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 11 nodes (0), 10 edges (0), time = 234.469ms.
2018-12-24 09:12:57.454653: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 11 nodes (0), 10 edges (0), time = 0.2ms.
needed time in inference-0:  0.5511505603790283
needed time in inference-1:  0.12833786010742188
needed time in inference-2:  0.12463760375976562
needed time in inference-3:  0.12786197662353516
needed time in inference-4:  0.12697529792785645
needed time in inference-5:  0.13029193878173828
needed time in inference-6:  0.12825393676757812
needed time in inference-7:  0.12726140022277832
needed time in inference-8:  0.128922700881958
needed time in inference-9:  0.12596487998962402
needed time in inference-10:  0.12665772438049316
needed time in inference-11:  0.13075709342956543
needed time in inference-12:  0.12952494621276855
needed time in inference-13:  0.1277174949645996
needed time in inference-14:  0.12876105308532715
needed time in inference-15:  0.12937045097351074
needed time in inference-16:  0.12622785568237305
needed time in inference-17:  0.13271570205688477
needed time in inference-18:  0.1276087760925293
needed time in inference-19:  0.12751555442810059
needed time in inference-20:  0.12947988510131836
needed time in inference-21:  0.12861347198486328
needed time in inference-22:  0.12863993644714355
needed time in inference-23:  0.1278541088104248
needed time in inference-24:  0.1224222183227539
needed time in inference-25:  0.128007173538208
needed time in inference-26:  0.12695741653442383
needed time in inference-27:  0.12907099723815918
needed time in inference-28:  0.12534546852111816
needed time in inference-29:  0.13288331031799316
needed time in inference-30:  0.13011384010314941
needed time in inference-31:  0.12917733192443848
needed time in inference-32:  0.1290149688720703
needed time in inference-33:  0.1290743350982666
needed time in inference-34:  0.1277458667755127
needed time in inference-35:  0.12714838981628418
needed time in inference-36:  0.1272110939025879
needed time in inference-37:  0.12929320335388184
needed time in inference-38:  0.12955617904663086
needed time in inference-39:  0.12946081161499023
needed time in inference-40:  0.12718725204467773
needed time in inference-41:  0.1272599697113037
needed time in inference-42:  0.12522172927856445
needed time in inference-43:  0.12158703804016113
needed time in inference-44:  0.1313614845275879
needed time in inference-45:  0.12937164306640625
needed time in inference-46:  0.1268150806427002
needed time in inference-47:  0.12861156463623047
needed time in inference-48:  0.12978434562683105
needed time in inference-49:  0.1288623809814453
average inference time:  0.13659294605255126

** UPDATE, ALREADY FIND A SOLUTION **

Hi, I wanna share that I already find a solution for my case. The modified code can be found here: https://drive.google.com/file/d/1GX-zmP-OQP3mtbAQWzOFQeyOhterZLcV/view?usp=sharing. The modifications are:

  1. Needs to define tf.Session with config=tf.ConfigProto(gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.50)) before performing TensorRT graph optimization
  2. Needs to isolate the inference session, e.g., by making the inference code in a separated function

Here is the output result:

2018-12-25 10:15:13.230369: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-12-25 10:15:13.311395: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-12-25 10:15:13.311791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.759
pciBusID: 0000:01:00.0
totalMemory: 5.92GiB freeMemory: 5.14GiB
2018-12-25 10:15:13.311806: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-25 10:15:13.689522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-25 10:15:13.689552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-12-25 10:15:13.689558: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-12-25 10:15:13.689727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3032 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
WARNING:tensorflow:From /home/cvrc/development_dir/Keras2TRT/4_inference_using_TensorRT_model_modif.py:55: FastGFile.__init__ (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.gfile.GFile.
2018-12-25 10:15:21.693342: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 1
2018-12-25 10:15:21.693530: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2018-12-25 10:15:21.693993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-25 10:15:21.694018: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-25 10:15:21.694024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-12-25 10:15:21.694029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-12-25 10:15:21.694141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3032 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-12-25 10:15:26.100677: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:853] MULTIPLE tensorrt candidate conversion: 2
2018-12-25 10:15:26.100850: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2018-12-25 10:15:26.100861: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2018-12-25 10:15:26.108500: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph
2018-12-25 10:15:26.108525: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op!
2018-12-25 10:17:04.664107: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:952] Engine my_trt_op_0 creation for segment 0, composed of 19 nodes succeeded.
2018-12-25 10:17:07.969592: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:952] Engine my_trt_op_1 creation for segment 1, composed of 10 nodes succeeded.
2018-12-25 10:17:11.384931: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2018-12-25 10:17:12.694581: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2018-12-25 10:17:13.012427: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2018-12-25 10:17:13.264191: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects.
2018-12-25 10:17:13.265261: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph
2018-12-25 10:17:13.265282: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 39 nodes (-7), 39 edges (-7), time = 2824.71704ms.
2018-12-25 10:17:13.265290: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 44 nodes (5), 44 edges (5), time = 278.095ms.
2018-12-25 10:17:13.265295: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 17 nodes (-27), 17 edges (-27), time = 102135.352ms.
2018-12-25 10:17:13.265299: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 17 nodes (0), 17 edges (0), time = 1.116ms.
2018-12-25 10:17:13.265302: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 17 nodes (0), 17 edges (0), time = 1950.57495ms.
2018-12-25 10:17:13.265306: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment
2018-12-25 10:17:13.265310: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 20 nodes (0), 19 edges (0), time = 1224.9ms.
2018-12-25 10:17:13.265313: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 20 nodes (0), 19 edges (0), time = 236.859ms.
2018-12-25 10:17:13.265317: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 20 nodes (0), 19 edges (0), time = 0.357ms.
2018-12-25 10:17:13.265321: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 20 nodes (0), 19 edges (0), time = 1309.24304ms.
2018-12-25 10:17:13.265324: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 20 nodes (0), 19 edges (0), time = 0.375ms.
2018-12-25 10:17:13.265328: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_1_native_segment
2018-12-25 10:17:13.265331: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 11 nodes (0), 10 edges (0), time = 267.426ms.
2018-12-25 10:17:13.265335: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   layout: Graph size after: 11 nodes (0), 10 edges (0), time = 48.771ms.
2018-12-25 10:17:13.265339: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 11 nodes (0), 10 edges (0), time = 0.161ms.
2018-12-25 10:17:13.265342: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   constant folding: Graph size after: 11 nodes (0), 10 edges (0), time = 251.584ms.
2018-12-25 10:17:13.265346: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503]   TensorRTOptimizer: Graph size after: 11 nodes (0), 10 edges (0), time = 0.166ms.
2018-12-25 10:17:15.751519: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-25 10:17:15.751561: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-25 10:17:15.751568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-12-25 10:17:15.751573: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-12-25 10:17:15.751689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3032 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
needed time in inference-0:  15.006317853927612
needed time in inference-1:  0.041179656982421875
needed time in inference-2:  0.03866219520568848
needed time in inference-3:  0.03570389747619629
needed time in inference-4:  0.03510284423828125
needed time in inference-5:  0.03611278533935547
needed time in inference-6:  0.04064226150512695
needed time in inference-7:  0.04653620719909668
needed time in inference-8:  0.03874993324279785
needed time in inference-9:  0.034958839416503906
needed time in inference-10:  0.03500032424926758
needed time in inference-11:  0.03582310676574707
needed time in inference-12:  0.04033660888671875
needed time in inference-13:  0.040354251861572266
needed time in inference-14:  0.03400826454162598
needed time in inference-15:  0.033091068267822266
needed time in inference-16:  0.03689217567443848
needed time in inference-17:  0.03311920166015625
needed time in inference-18:  0.03401374816894531
needed time in inference-19:  0.03716325759887695
needed time in inference-20:  0.03750157356262207
needed time in inference-21:  0.03295445442199707
needed time in inference-22:  0.03301501274108887
needed time in inference-23:  0.03294515609741211
needed time in inference-24:  0.03882884979248047
needed time in inference-25:  0.03882479667663574
needed time in inference-26:  0.03724384307861328
needed time in inference-27:  0.03335261344909668
needed time in inference-28:  0.033097267150878906
needed time in inference-29:  0.03336834907531738
needed time in inference-30:  0.03383231163024902
needed time in inference-31:  0.03699827194213867
needed time in inference-32:  0.03634238243103027
needed time in inference-33:  0.034226417541503906
needed time in inference-34:  0.03304028511047363
needed time in inference-35:  0.032994747161865234
needed time in inference-36:  0.03297996520996094
needed time in inference-37:  0.04032158851623535
needed time in inference-38:  0.03634953498840332
needed time in inference-39:  0.033815622329711914
needed time in inference-40:  0.036960601806640625
needed time in inference-41:  0.033074140548706055
needed time in inference-42:  0.032826900482177734
needed time in inference-43:  0.036762237548828125
needed time in inference-44:  0.03467607498168945
needed time in inference-45:  0.0336000919342041
needed time in inference-46:  0.03297138214111328
needed time in inference-47:  0.032752037048339844
needed time in inference-48:  0.0335540771484375
needed time in inference-49:  0.036481618881225586
average inference time:  0.3351892137527466

To be honest, I find this solution is only by doing trial one by one without knowing the reason behind that. lol. If you have a related explanation, will be glad to know.

Thanks.

Hi All,

After starting to try TensorRT optimization and I personally found difficulties here and there, so, I decide to make a video tutorial here how we can optimize deep learning model obtained using Keras and Tensorflow. I also demonstrate to optimize YOLOv3. Hope it helps for those who begins trying to use TensorRT, and you don’t encounter similar difficulties as I experienced before.

  1. Optimizing Tensorflow to TensorRT:
    https://www.youtube.com/watch?v=AIGOSz2tFP8

  2. Visualizing model graph before and after TensorRT optimization:
    https://www.youtube.com/watch?v=Hum7awcBffY

  3. Optimizing Keras model to TensorRT:
    https://www.youtube.com/watch?v=ky4mFPewl8Y

  4. Optimizing YOLOv3:
    https://www.youtube.com/watch?v=stBYLsq15lY

  5. YOLOv3 sample result, before and after TensorRT optimization:
    https://www.youtube.com/watch?v=IVUl61p6efU