TensorRT optimization random outcome

Hi everyone!
Our team has a problem with outcome of optimization of tensorflow object detection models with tensorRT

Setup:
Jetson Nano with system installed via SDK Manager, JetPack 4.2.2 (rev.1)
tensorflow-gpu 1.14.0+nv19.7
tensorrt 5.1.6.1
nvcc (Cuda compiler driver) release 10.0, V10.0.326
NV Power Mode: MAXN
Issue:
We trained ssd_mobilenet_v2 model from object detection model zoo on custom data using docker image with tensorflow 1.12. Trained model was converted from checkpoints format
to frozen graph with “export_inference_graph.py” from:
https://github.com/tensorflow/models/tree/master/research/object_detection
Frozen graph model was loaded on Jetson Nano and optimized using “create_inference_graph”. When trying to use TrtGraphConverter class, as mentioned here: from https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#using-metagraph-checkpoint an error occures: Failed to import metagraph. Anyway using create_inference_graph with precision=FP32, minimum_segment_size=3, max_batch_size=1 returns different outputs each time it is invoked. The memory seems to be used.
Sometimes number of nodes after this operation is:
all nodes pre-optimization: 2671
TRT Engine opts: 12
all nodes post-optimization: 326
LOG:

2019-10-31 12:39:16.782509: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-31 12:39:28.397544: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
WARNING:tensorflow:From frozen2trt2.py:24: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From frozen2trt2.py:25: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

2019-10-31 12:39:41.337093: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-10-31 12:39:41.353294: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 12:39:41.353445: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2019-10-31 12:39:41.353750: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2019-10-31 12:39:41.373688: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2019-10-31 12:39:41.374369: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2f4ddba0 executing computations on platform Host. Devices:
2019-10-31 12:39:41.374438: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-10-31 12:39:41.489729: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 12:39:41.490119: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x30abc9c0 executing computations on platform CUDA. Devices:
2019-10-31 12:39:41.490187: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): NVIDIA Tegra X1, Compute Capability 5.3
2019-10-31 12:39:41.490910: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 12:39:41.491070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: NVIDIA Tegra X1 major: 5 minor: 3 memoryClockRate(GHz): 0.9216
pciBusID: 0000:00:00.0
2019-10-31 12:39:41.491164: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-31 12:39:41.491562: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-31 12:39:41.491756: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-10-31 12:39:41.491947: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-10-31 12:39:41.522212: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-10-31 12:39:41.542052: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-10-31 12:39:41.542352: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-31 12:39:41.542801: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 12:39:41.543344: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 12:39:41.543502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-10-31 12:39:45.637233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-31 12:39:45.637309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-10-31 12:39:45.637333: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-10-31 12:39:45.637813: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 12:39:45.638164: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 12:39:45.638334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 664 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
2019-10-31 12:39:54.975125: I tensorflow/compiler/tf2tensorrt/segment/segment.cc:460] There are 154 ops of 34 different types in the graph that are not converted to TensorRT: Range, Sum, GreaterEqual, Where, Equal, Select, Size, Less, ConcatV2, Fill, Mul, ExpandDims, Unpack, GatherV2, NoOp, TopKV2, Cast, Slice, Transpose, Pad, Placeholder, Greater, Sub, Const, Pack, Identity, NonMaxSuppressionV3, Assert, Reshape, Squeeze, Add, Shape, Minimum, StridedSlice, (For more information see https://docs.nvidia.com/deeplearning/dgx/tf-trt-user-guide/index.html#supported-ops).
2019-10-31 12:39:55.346258: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:733] Number of TensorRT candidate segments: 12
2019-10-31 12:39:55.949408: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-31 12:39:56.143756: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-31 12:40:00.993647: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-31 12:41:43.617383: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_0 added for segment 0 consisting of 789 nodes succeeded.
2019-10-31 12:41:43.745074: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_1 added for segment 1 consisting of 22 nodes succeeded.
2019-10-31 12:41:43.772492: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_2 added for segment 2 consisting of 3 nodes succeeded.
2019-10-31 12:41:43.804842: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_3 added for segment 3 consisting of 3 nodes succeeded.
2019-10-31 12:41:43.833656: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_4 added for segment 4 consisting of 3 nodes succeeded.
2019-10-31 12:41:43.863932: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_5 added for segment 5 consisting of 3 nodes succeeded.
2019-10-31 12:41:43.892892: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_6 added for segment 6 consisting of 3 nodes succeeded.
2019-10-31 12:41:43.922113: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_7 added for segment 7 consisting of 3 nodes succeeded.
2019-10-31 12:41:43.970877: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/Area/TRTEngineOp_8 added for segment 8 consisting of 6 nodes succeeded.
2019-10-31 12:41:44.015790: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node Postprocessor/BatchMultiClassNonMaxSuppression/TRTEngineOp_9 added for segment 9 consisting of 14 nodes succeeded.
2019-10-31 12:41:44.046634: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node Postprocessor/BatchMultiClassNonMaxSuppression/TRTEngineOp_10 added for segment 10 consisting of 3 nodes succeeded.
2019-10-31 12:41:44.071502: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node Postprocessor/TRTEngineOp_11 added for segment 11 consisting of 3 nodes succeeded.
2019-10-31 12:41:44.167981: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:739] Optimization results for grappler item: tf_graph
2019-10-31 12:41:44.168079: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741]   constant folding: Graph size after: 1154 nodes (-1517), 1265 edges (-1691), time = 3652.29395ms.
2019-10-31 12:41:44.168104: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741]   layout: Graph size after: 1169 nodes (15), 1291 edges (26), time = 173.861ms.
2019-10-31 12:41:44.168125: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741]   constant folding: Graph size after: 1169 nodes (0), 1291 edges (0), time = 156.13ms.
2019-10-31 12:41:44.168144: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741]   TensorRTOptimizer: Graph size after: 326 nodes (-843), 404 edges (-887), time = 109313.375ms.
WARNING:tensorflow:From frozen2trt2.py:30: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

WARNING:tensorflow:From frozen2trt2.py:32: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-10-31 12:42:34.608055: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 12:42:34.608273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: NVIDIA Tegra X1 major: 5 minor: 3 memoryClockRate(GHz): 0.9216
pciBusID: 0000:00:00.0
2019-10-31 12:42:34.609062: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-31 12:42:34.609540: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-31 12:42:34.609703: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-10-31 12:42:34.609867: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-10-31 12:42:34.611511: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-10-31 12:42:34.611795: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-10-31 12:42:34.611971: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-31 12:42:34.612711: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 12:42:34.613299: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 12:42:34.613521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-10-31 12:42:34.654486: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 12:42:34.654655: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: NVIDIA Tegra X1 major: 5 minor: 3 memoryClockRate(GHz): 0.9216
pciBusID: 0000:00:00.0
2019-10-31 12:42:34.654801: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-31 12:42:34.654907: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-31 12:42:34.654983: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-10-31 12:42:34.655043: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-10-31 12:42:34.655218: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-10-31 12:42:34.655503: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-10-31 12:42:34.655599: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-31 12:42:34.656073: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 12:42:34.657022: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 12:42:34.657276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-10-31 12:42:34.657713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-31 12:42:34.657801: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-10-31 12:42:34.657983: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-10-31 12:42:34.658647: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 12:42:34.659268: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 12:42:34.659432: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 664 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
{'input_filename': 'frozen_inference_graph.pb', 'output_filename': 'trt_optimized_inference_graph_FP32.pb', 'input_path': './', 'output_path': './'}
OPTIMIZING MODEL...
All nodes pre-optimization: 2671
TRT Engine opts: 12
All nodes post-optimization: 326

and it is the “better” output, with which we can perform inference in around 100-105ms
and sometimes it creates less or none TRT Engine opts, thus output is:
all nodes pre-optimization: 2671
TRT Engine opts: 11
all nodes post-optimization: 1114
and then inference time is about 210ms.
LOG:

2019-10-31 14:35:29.179125: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-31 14:35:36.724575: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
WARNING:tensorflow:From frozen2trt2.py:25: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From frozen2trt2.py:26: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

2019-10-31 14:35:49.326881: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-10-31 14:35:49.340585: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 14:35:49.340736: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2019-10-31 14:35:49.341066: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2019-10-31 14:35:49.360550: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2019-10-31 14:35:49.361694: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x84a1d70 executing computations on platform Host. Devices:
2019-10-31 14:35:49.361765: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-10-31 14:35:49.426557: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 14:35:49.426854: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x9a6c9a0 executing computations on platform CUDA. Devices:
2019-10-31 14:35:49.426905: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): NVIDIA Tegra X1, Compute Capability 5.3
2019-10-31 14:35:49.427454: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 14:35:49.427565: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: NVIDIA Tegra X1 major: 5 minor: 3 memoryClockRate(GHz): 0.9216
pciBusID: 0000:00:00.0
2019-10-31 14:35:49.427636: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-31 14:35:49.427757: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-31 14:35:49.427848: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-10-31 14:35:49.427933: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-10-31 14:35:49.431163: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-10-31 14:35:49.433830: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-10-31 14:35:49.434029: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-31 14:35:49.434350: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 14:35:49.434659: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 14:35:49.434742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-10-31 14:35:51.165608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-31 14:35:51.165679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-10-31 14:35:51.165703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-10-31 14:35:51.166147: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 14:35:51.166505: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 14:35:51.166790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 473 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
2019-10-31 14:36:00.568729: I tensorflow/compiler/tf2tensorrt/segment/segment.cc:460] There are 154 ops of 34 different types in the graph that are not converted to TensorRT: Range, Sum, GreaterEqual, Where, Equal, Select, Size, Less, ConcatV2, Fill, Mul, ExpandDims, Unpack, GatherV2, NoOp, TopKV2, Cast, Slice, Transpose, Pad, Placeholder, Greater, Sub, Const, Pack, Identity, NonMaxSuppressionV3, Assert, Reshape, Squeeze, Add, Shape, Minimum, StridedSlice, (For more information see https://docs.nvidia.com/deeplearning/dgx/tf-trt-user-guide/index.html#supported-ops).
2019-10-31 14:36:00.940995: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:733] Number of TensorRT candidate segments: 12
2019-10-31 14:36:01.563045: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-31 14:36:01.774501: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-31 14:36:04.819149: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-31 14:36:18.661221: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 472.48MiB (rounded to 495428864).  Current allocation summary follows.
2019-10-31 14:36:18.661887: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (256): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.662077: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (512): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.662229: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (1024): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.662386: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (2048): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.662532: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (4096): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.662674: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (8192): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.662819: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (16384): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.662962: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (32768): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.663111: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (65536): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.663259: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (131072): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.663409: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (262144): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.663561: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (524288): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.663707: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (1048576): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.663854: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (2097152): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.664029: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (4194304): 	Total Chunks: 2, Chunks in use: 2. 8.24MiB allocated for chunks. 8.24MiB in use in bin. 8.24MiB client-requested in use in bin.
2019-10-31 14:36:18.664188: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (8388608): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.664329: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (16777216): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.664729: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (33554432): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.664881: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (67108864): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.665022: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (134217728): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.665191: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (268435456): 	Total Chunks: 1, Chunks in use: 0. 464.77MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-10-31 14:36:18.665377: I tensorflow/core/common_runtime/bfc_allocator.cc:780] Bin for 472.48MiB was 256.00MiB, Chunk State: 
2019-10-31 14:36:18.665653: I tensorflow/core/common_runtime/bfc_allocator.cc:786]   Size: 464.77MiB | Requested Size: 0B | in_use: 0 | bin_num: 20, prev:   Size: 4.12MiB | Requested Size: 4.12MiB | in_use: 1 | bin_num: -1
2019-10-31 14:36:18.665803: I tensorflow/core/common_runtime/bfc_allocator.cc:793] Next region of size 495988736
2019-10-31 14:36:18.665974: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0xf00e50000 next 1 of size 4320768
2019-10-31 14:36:18.666117: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0xf0126ee00 next 2 of size 4320768
2019-10-31 14:36:18.666244: I tensorflow/core/common_runtime/bfc_allocator.cc:800] Free  at 0xf0168dc00 next 18446744073709551615 of size 487347200
2019-10-31 14:36:18.666360: I tensorflow/core/common_runtime/bfc_allocator.cc:809]      Summary of in-use Chunks by size: 
2019-10-31 14:36:18.666516: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 2 Chunks of size 4320768 totalling 8.24MiB
2019-10-31 14:36:18.666602: I tensorflow/core/common_runtime/bfc_allocator.cc:816] Sum Total of in-use chunks: 8.24MiB
2019-10-31 14:36:18.666665: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocated_bytes_: 495988736 memory_limit_: 495988736 available bytes: 0 curr_region_allocation_bytes_: 991977472
2019-10-31 14:36:18.666758: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats: 
Limit:                   495988736
InUse:                     8641536
MaxInUse:                  8641536
NumAllocs:                       2
MaxAllocSize:              4320768

2019-10-31 14:36:18.666833: W tensorflow/core/common_runtime/bfc_allocator.cc:319] **__________________________________________________________________________________________________
2019-10-31 14:36:18.667081: E tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:41] DefaultLogger resources.h (154) - OutOfMemory Error in GpuMemory: 0
2019-10-31 14:36:18.699826: E tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:41] DefaultLogger GPU memory allocation failed during tactic selection for layer: (Unnamed Layer* 0) [Scale] + (Unnamed Layer* 1) [Scale]
2019-10-31 14:36:18.701891: E tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:41] DefaultLogger resources.h (154) - OutOfMemory Error in GpuMemory: 0
2019-10-31 14:36:18.705741: W tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:838] TensorRT node TRTEngineOp_0 added for segment 0 consisting of 789 nodes failed: Internal: Failed to build TensorRT engine. Fallback to TF...
2019-10-31 14:36:18.973018: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_1 added for segment 1 consisting of 22 nodes succeeded.
2019-10-31 14:36:18.996189: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_2 added for segment 2 consisting of 3 nodes succeeded.
2019-10-31 14:36:19.018723: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_3 added for segment 3 consisting of 3 nodes succeeded.
2019-10-31 14:36:19.042173: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_4 added for segment 4 consisting of 3 nodes succeeded.
2019-10-31 14:36:19.061488: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_5 added for segment 5 consisting of 3 nodes succeeded.
2019-10-31 14:36:19.082478: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_6 added for segment 6 consisting of 3 nodes succeeded.
2019-10-31 14:36:19.105822: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node TRTEngineOp_7 added for segment 7 consisting of 3 nodes succeeded.
2019-10-31 14:36:19.161703: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/Area/TRTEngineOp_8 added for segment 8 consisting of 6 nodes succeeded.
2019-10-31 14:36:19.210589: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node Postprocessor/BatchMultiClassNonMaxSuppression/TRTEngineOp_9 added for segment 9 consisting of 14 nodes succeeded.
2019-10-31 14:36:19.239720: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node Postprocessor/BatchMultiClassNonMaxSuppression/TRTEngineOp_10 added for segment 10 consisting of 3 nodes succeeded.
2019-10-31 14:36:19.253105: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:835] TensorRT node Postprocessor/TRTEngineOp_11 added for segment 11 consisting of 3 nodes succeeded.
2019-10-31 14:36:19.352321: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:739] Optimization results for grappler item: tf_graph
2019-10-31 14:36:19.352451: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741]   constant folding: Graph size after: 1154 nodes (-1517), 1265 edges (-1691), time = 3664.3291ms.
2019-10-31 14:36:19.352479: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741]   layout: Graph size after: 1169 nodes (15), 1291 edges (26), time = 176.661ms.
2019-10-31 14:36:19.352501: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741]   constant folding: Graph size after: 1169 nodes (0), 1291 edges (0), time = 155.807ms.
2019-10-31 14:36:19.352520: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:741]   TensorRTOptimizer: Graph size after: 1114 nodes (-55), 1223 edges (-68), time = 18881.2871ms.
WARNING:tensorflow:From frozen2trt2.py:31: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

WARNING:tensorflow:From frozen2trt2.py:33: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-10-31 14:37:59.829028: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 14:37:59.829238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: NVIDIA Tegra X1 major: 5 minor: 3 memoryClockRate(GHz): 0.9216
pciBusID: 0000:00:00.0
2019-10-31 14:37:59.860096: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-31 14:37:59.860362: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-31 14:37:59.860547: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-10-31 14:37:59.860628: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-10-31 14:37:59.861505: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-10-31 14:37:59.861673: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-10-31 14:37:59.887050: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-31 14:37:59.887439: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 14:37:59.887804: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 14:37:59.887915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-10-31 14:37:59.937706: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 14:37:59.938017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: NVIDIA Tegra X1 major: 5 minor: 3 memoryClockRate(GHz): 0.9216
pciBusID: 0000:00:00.0
2019-10-31 14:37:59.938167: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-31 14:37:59.938335: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-31 14:37:59.938454: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-10-31 14:37:59.938563: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-10-31 14:37:59.938821: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-10-31 14:37:59.939007: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-10-31 14:37:59.939129: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-31 14:37:59.939702: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 14:37:59.940466: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 14:37:59.940628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-10-31 14:37:59.940791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-31 14:37:59.940838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-10-31 14:37:59.940876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-10-31 14:37:59.941419: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 14:37:59.942080: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-10-31 14:37:59.942348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 473 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
{'input_filename': 'frozen_inference_graph.pb', 'output_filename': 'trt_optimized_inference_graph_FP32.pb', 'input_path': './', 'output_path': './'}
OPTIMIZING MODEL...
All nodes pre-optimization: 2671
TRT Engine opts: 11
All nodes post-optimization: 1114

The inference is performed on images 600x600x3.

Questions:
What causes this kind of behaviour?
How to properly optimize tensorflow trained model with tensorRT for inference on Jetson Nano?
What is the minimum inference time for ssd_mobilenet_v2 with images of this size?

Hi,

1. This may be related to the available memory amount.
It’s known that TensorFlow takes a lot of memory for TF session.
Different memory amount may leads to different algorithm choice.

2. It’s more recommended to use pure TensorRT instead.
Here is a tutorial of ssd_mobilenet_v2 for your reference:
https://github.com/AastaNV/TRT_object_detection

3. Here is our benchmark result for Jetson Nano.
https://developer.nvidia.com/embedded/jetson-nano-dl-inference-benchmarks

Although there is no 600x600 Mobilenet-V2 test item, we can achieve 8FPS with 960x544 input.

Thanks.

How to resize input size with MIPI CSI camera? Different input size have the same as speed.

Hello AastaLLL,
Thanks a lot for the answer, I tried to follow your advice regarding using pure TensorRT model instead of TF-TRT referred in #2.

I managed to run ssd_mobilenet_v2_coco model downloaded from:
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md
the inference works fine with images 300x300 and takes approximately 40-45ms.

But after trying to repeat the procedure for our custom model it failed.
Instead of default frozen_graph from model zoo I supplied our frozen_graph of network trained to detect 2 classes. I modified the mapping file to contain only two classes, changed the config accordingly:

import graphsurgeon as gs
import os 
path = "."+os.sep+'model_custom/frozen_inference_graph.pb'
TRTbin = 'TRT_ssd_mobilenet_v2_custom.bin'
output_name = ['NMS']
<b>dims = [3,600,600]</b>
layout = 7

def add_plugin(graph):
    all_assert_nodes = graph.find_nodes_by_op("Assert")
    graph.remove(all_assert_nodes, remove_exclusive_dependencies=True)

    all_identity_nodes = graph.find_nodes_by_op("Identity")
    graph.forward_inputs(all_identity_nodes)

    Input = gs.create_plugin_node(
        name="Input",
        op="Placeholder",
        shape=[1, 3, 600, 600]
    )

    PriorBox = gs.create_plugin_node(
        name="GridAnchor",
        op="GridAnchor_TRT",
        minSize=0.2,
        maxSize=0.95,
        aspectRatios=[1.0, 2.0, 0.5, 3.0, 0.33],
        variance=[0.1,0.1,0.2,0.2],
        featureMapShapes=[19, 10, 5, 3, 2, 1],
        numLayers=6
    )

    NMS = gs.create_plugin_node(
        name="NMS",
        op="NMS_TRT",
        shareLocation=1,
        varianceEncodedInTarget=0,
        backgroundLabelId=0,
        confidenceThreshold=1e-8,
        nmsThreshold=0.6,
        topK=100,
        keepTopK=100,
        <b>numClasses=3,</b>
        [b]inputOrder=[0,2,1],   
        #inputOrder=[1, 0, 2],[/b]
        confSigmoid=1,
        isNormalized=1
    )

    concat_priorbox = gs.create_node(
        "concat_priorbox",
        op="ConcatV2",
        axis=2
    )

    concat_box_loc = gs.create_plugin_node(
        "concat_box_loc",
        op="FlattenConcat_TRT",
    )

    concat_box_conf = gs.create_plugin_node(
        "concat_box_conf",
        op="FlattenConcat_TRT",
    )

    namespace_plugin_map = {
        "MultipleGridAnchorGenerator": PriorBox,
        "Postprocessor": NMS,
        "Preprocessor": Input,
        "ToFloat": Input,
        "image_tensor": Input,
        "Concatenate": concat_priorbox,
        "concat": concat_box_loc,
        "concat_1": concat_box_conf
    }

    graph.collapse_namespaces(namespace_plugin_map)
    graph.remove(graph.graph_outputs, remove_exclusive_dependencies=False)
    graph.find_nodes_by_op("NMS_TRT")[0].input.remove("Input")

    return graph

Lines in bold are the ones modified.

No matter if the input order is [0,2,1] or [1, 0, 2] and after re-generating the frozen file following steps from here:
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/exporting_models.md
the output is always:
[TensorRT] ERROR: UffParser: Validator error: Case: Unsupported operation _Cast
[TensorRT] ERROR: Network must have at least one output
buf = engine.serialize()
AttributeError: ‘NoneType’ object has no attribute ‘serialize’

I found this topic, in which similar issues are discussed:
https://devtalk.nvidia.com/default/topic/1056054/jetson-tx2/how-to-retrain-ssd_inception_v2_coco_2017_11_17-from-the-tensorrt-samples/2
there in #25 you suggested changing the name of the output layer.
1.Could it be the case here?
2.And should the “libflattenconcat.so” file be rebuild for each model optimization, if so could you give some hints about it?

Hi,

Cast is dtype update operation, and this is automatically handled by the TensorRT.
So, you might try to remove the operation directly or replace it with toFloat.

Here is a related topic for your reference:
https://devtalk.nvidia.com/default/topic/1056054/jetson-tx2/how-to-retrain-ssd_inception_v2_coco_2017_11_17-from-the-tensorrt-samples/post/5391834/#5391834

Thanks.