optimizing tf-trt load time

I am using tf-trt for inference (as it is more or less the only available performance option without writing plugins). My code has the following segment:

with tf.gfile.GFile('./ssd_mobilenet_v1_coco_trt.pb', 'rb') as pf:
       trt_graph.ParseFromString(pf.read())
       print("#3", time.time())
       input_names = ['image_tensor']
       output_names = ['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']

which takes a long long time to execute. My guess is that this creates an engine. In there a way of saving and loading this engine to make loading time quicker? when using direct tensorrt, like in the /usr/src/tensorrt/samples/python/uff_ssd sample, those function are used to save and load the engine:

def save_engine(engine, engine_dest_path):
    buf = engine.serialize()
    with open(engine_dest_path, 'wb') as f:
        f.write(buf)

def load_engine(trt_runtime, engine_path):
    with open(engine_path, 'rb') as f:
        engine_data = f.read()
    engine = trt_runtime.deserialize_cuda_engine(engine_data)
    return engine

but I can’t see how to access the underlying engine from the trt api.

currently, on my 5w nano, the parsefromstring takes about 3 minutes and later on tf.import_graph_def(trt_graph, name=’’) takes another minute or so. That is a long time…

Hi,

Suppose the function takes time is :

trt_graph = trt.create_inference_graph(
    input_graph_def=frozen_graph,
    outputs=output_names,
    max_batch_size=1,
    max_workspace_size_bytes=1 << 25,
    precision_mode='FP16',
    minimum_segment_size=50
)

This step is to build TensorRT engine from the TensorFlow model and it really takes time.

Have you tried to serialize trt_graph directly?
Sorry that we are not sure if this is implemented.
But if it is workable, you will be able to deserialize TensorRT directly without rebuilding.

Thanks.

so, the full code is here https://devtalk.nvidia.com/default/topic/1051389/jetson-nano/is-there-any-demos-available-for-python-jetson-inference/post/5336347/#5336347

as you can see I am serializing this, but still the next part takes forever… the parse from string is from the serialized file. this is the output:

#1 1557364880.002708
#2 1557364880.0028615
#3 1557365040.6913075
creating session...
2019-05-09 13:24:00.714396: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2019-05-09 13:24:00.717803: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x2c3e7b90 executing computations on platform Host. Devices:
2019-05-09 13:24:00.718401: I tensorflow/compiler/xla/service/service.cc:168]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-09 13:24:00.806407: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:965] ARM64 does not support NUMA - returning NUMA node zero
2019-05-09 13:24:00.807208: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x2aeb1290 executing computations on platform CUDA. Devices:
2019-05-09 13:24:00.807323: I tensorflow/compiler/xla/service/service.cc:168]   StreamExecutor device (0): NVIDIA Tegra X1, Compute Capability 5.3
2019-05-09 13:24:00.807906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: NVIDIA Tegra X1 major: 5 minor: 3 memoryClockRate(GHz): 0.9216
pciBusID: 0000:00:00.0
totalMemory: 3.86GiB freeMemory: 1.43GiB
2019-05-09 13:24:00.808401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-09 13:24:03.358548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-09 13:24:03.358637: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-05-09 13:24:03.358686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-05-09 13:24:03.358964: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 916 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
#2.5 1557365100.7052696
#4 1557365100.7056468

Between #2 and #3 (~3 minutes):

with tf.gfile.GFile('./ssd_mobilenet_v1_coco_trt.pb', 'rb') as pf:
       trt_graph.ParseFromString(pf.read())

between #3 and #2.5 (oops) (~1 minute):

tf_config = tf.ConfigProto()
tf_config.gpu_options.allow_growth = True

tf_sess = tf.Session(config=tf_config)

tf.import_graph_def(trt_graph, name='')

Hi,

It looks like there are some latency to enable TensorFlow and de-serialize the model.

Is pure TensorRT a option for you?
This will require you to implement some plugin layer for the non-supported operation.
But it will save time/memory since you don’t need to create TensorFlow session anymore.

Thanks.

I have a solution to the “extremely long model loading time problem” of TF-TRT now. Please check out my blog post for details: https://jkjung-avt.github.io/tf-trt-revisited/.

Thats great news! Can you please share some performance timings before I dive in?

I’ve shared my measurement numbers before. Check out the link below. Those TF-TRT optimized models used to take >10 minutes to load (with tensorflow 1.9~1.12). Now they only take less than 5 seconds.

https://devtalk.nvidia.com/default/topic/1037019/jetson-tx2/tensorflow-object-detection-and-image-classification-accelerated-for-nvidia-jetson/post/5288250/#5288250

Besides TX2, I’ve also tested TF-TRT object detection models on Jetson Nano. I shared the result in this blog post: https://jkjung-avt.github.io/tf-trt-on-nano/.

Hello jkjung13, I am Having a Facenet Model on Tensorflow. It Takes for a long time with Window Hang while Loading the Model. What can I do to Optimize the Timings and Load and Take inference quicker?

Hi. I am also having the same issue. First, I serialized the TF graph using trt.create_inference_graph() with FP16. After that, I am using another script for inference where I am loading the TRT graph using:

with tf.gfile.GFile('./ssd_mobilenet_v1_coco_trt.pb', 'rb') as pf:
    trt_graph.ParseFromString(pf.read())

It really takes a lot of time to get loaded. Note that it does not even include any TF session and I am using Jetson Nano.

Hi,

ssd_mobilenet_v1 can be directly executed with pure TensorRT.
Would you mind to give it a try?
https://github.com/AastaNV/TRT_object_detection

Thanks.

I took the “TRT_object_detection” example and created a python demo program which could do real-time object detection with various inputs. Check out this post for more information:

https://devtalk.nvidia.com/default/topic/1050377/jetson-nano/deep-learning-inference-benchmarking-instructions/post/5395507/#5395507