optimizing tf-trt load time

moshe.livne · May 9, 2019, 1:28am

I am using tf-trt for inference (as it is more or less the only available performance option without writing plugins). My code has the following segment:

with tf.gfile.GFile('./ssd_mobilenet_v1_coco_trt.pb', 'rb') as pf:
       trt_graph.ParseFromString(pf.read())
       print("#3", time.time())
       input_names = ['image_tensor']
       output_names = ['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections']

which takes a long long time to execute. My guess is that this creates an engine. In there a way of saving and loading this engine to make loading time quicker? when using direct tensorrt, like in the /usr/src/tensorrt/samples/python/uff_ssd sample, those function are used to save and load the engine:

def save_engine(engine, engine_dest_path):
    buf = engine.serialize()
    with open(engine_dest_path, 'wb') as f:
        f.write(buf)

def load_engine(trt_runtime, engine_path):
    with open(engine_path, 'rb') as f:
        engine_data = f.read()
    engine = trt_runtime.deserialize_cuda_engine(engine_data)
    return engine

but I can’t see how to access the underlying engine from the trt api.

currently, on my 5w nano, the parsefromstring takes about 3 minutes and later on tf.import_graph_def(trt_graph, name=‘’) takes another minute or so. That is a long time…

AastaLLL · May 9, 2019, 7:17am

Hi,

Suppose the function takes time is :

trt_graph = trt.create_inference_graph(
    input_graph_def=frozen_graph,
    outputs=output_names,
    max_batch_size=1,
    max_workspace_size_bytes=1 << 25,
    precision_mode='FP16',
    minimum_segment_size=50
)

This step is to build TensorRT engine from the TensorFlow model and it really takes time.

Have you tried to serialize trt_graph directly?
Sorry that we are not sure if this is implemented.
But if it is workable, you will be able to deserialize TensorRT directly without rebuilding.

Thanks.

moshe.livne · May 9, 2019, 7:59am

so, the full code is here https://devtalk.nvidia.com/default/topic/1051389/jetson-nano/is-there-any-demos-available-for-python-jetson-inference/post/5336347/#5336347

as you can see I am serializing this, but still the next part takes forever… the parse from string is from the serialized file. this is the output:

#1 1557364880.002708
#2 1557364880.0028615
#3 1557365040.6913075
creating session...
2019-05-09 13:24:00.714396: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2019-05-09 13:24:00.717803: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x2c3e7b90 executing computations on platform Host. Devices:
2019-05-09 13:24:00.718401: I tensorflow/compiler/xla/service/service.cc:168]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-09 13:24:00.806407: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:965] ARM64 does not support NUMA - returning NUMA node zero
2019-05-09 13:24:00.807208: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x2aeb1290 executing computations on platform CUDA. Devices:
2019-05-09 13:24:00.807323: I tensorflow/compiler/xla/service/service.cc:168]   StreamExecutor device (0): NVIDIA Tegra X1, Compute Capability 5.3
2019-05-09 13:24:00.807906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: NVIDIA Tegra X1 major: 5 minor: 3 memoryClockRate(GHz): 0.9216
pciBusID: 0000:00:00.0
totalMemory: 3.86GiB freeMemory: 1.43GiB
2019-05-09 13:24:00.808401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-09 13:24:03.358548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-09 13:24:03.358637: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-05-09 13:24:03.358686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-05-09 13:24:03.358964: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 916 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
#2.5 1557365100.7052696
#4 1557365100.7056468

Between #2 and #3 (~3 minutes):

with tf.gfile.GFile('./ssd_mobilenet_v1_coco_trt.pb', 'rb') as pf:
       trt_graph.ParseFromString(pf.read())

between #3 and #2.5 (oops) (~1 minute):

tf_config = tf.ConfigProto()
tf_config.gpu_options.allow_growth = True

tf_sess = tf.Session(config=tf_config)

tf.import_graph_def(trt_graph, name='')

AastaLLL · May 17, 2019, 7:27am

Hi,

It looks like there are some latency to enable TensorFlow and de-serialize the model.

Is pure TensorRT a option for you?
This will require you to implement some plugin layer for the non-supported operation.
But it will save time/memory since you don’t need to create TensorFlow session anymore.

Thanks.

jkjung13 · May 24, 2019, 6:27am

I have a solution to the “extremely long model loading time problem” of TF-TRT now. Please check out my blog post for details: [url]https://jkjung-avt.github.io/tf-trt-revisited/[/url].

moshe.livne · May 24, 2019, 6:47am

Thats great news! Can you please share some performance timings before I dive in?

jkjung13 · May 24, 2019, 6:51am

I’ve shared my measurement numbers before. Check out the link below. Those TF-TRT optimized models used to take >10 minutes to load (with tensorflow 1.9~1.12). Now they only take less than 5 seconds.

[url]https://devtalk.nvidia.com/default/topic/1037019/jetson-tx2/tensorflow-object-detection-and-image-classification-accelerated-for-nvidia-jetson/post/5288250/#5288250[/url]

jkjung13 · June 3, 2019, 12:47pm

Besides TX2, I’ve also tested TF-TRT object detection models on Jetson Nano. I shared the result in this blog post: Testing TF-TRT Object Detectors on Jetson Nano.

harsathpanther · June 6, 2019, 1:25pm

Hello jkjung13, I am Having a Facenet Model on Tensorflow. It Takes for a long time with Window Hang while Loading the Model. What can I do to Optimize the Timings and Load and Take inference quicker?

spsayakpaul · July 22, 2019, 12:24pm

Hi. I am also having the same issue. First, I serialized the TF graph using trt.create_inference_graph() with FP16. After that, I am using another script for inference where I am loading the TRT graph using:

with tf.gfile.GFile('./ssd_mobilenet_v1_coco_trt.pb', 'rb') as pf:
    trt_graph.ParseFromString(pf.read())

It really takes a lot of time to get loaded. Note that it does not even include any TF session and I am using Jetson Nano.

AastaLLL · August 2, 2019, 7:01am

Hi,

ssd_mobilenet_v1 can be directly executed with pure TensorRT.
Would you mind to give it a try?
[url]https://github.com/AastaNV/TRT_object_detection[/url]

Thanks.

jkjung13 · October 25, 2019, 8:12am

I took the “TRT_object_detection” example and created a python demo program which could do real-time object detection with various inputs. Check out this post for more information:

[url]https://devtalk.nvidia.com/default/topic/1050377/jetson-nano/deep-learning-inference-benchmarking-instructions/post/5395507/#5395507[/url]