Tensorflow inference using TRT converted model

krupa.gopal · April 26, 2021, 7:42am

Description

Running trt converted model using Tensorflow 1.15.2 (Nvidia Release 20.02-tf1) takes up too much of CPU RAM than expected (~7 GB) as opposed to running the model without conversion to trt (~2.5 GB)
This issue is irrespective of whether model is converted dynamically or statically.

Environment

TensorRT Version:
GPU Type: GeForce RTX 2080
Nvidia Driver Version: 440.33.01
CUDA Version: 10.2
CUDNN Version: 7.6.5
Operating System + Version: Ubuntu 18.04.3 LTS
Python Version (if applicable): 3.6.9
TensorFlow Version (if applicable): 1.15.2
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): Nvidia Release 20.02-tf1

Steps To Reproduce

import os 
os.environ["CUDA_VISIBLE_DEVICES"] = "0" 
import tensorflow as tf 
from tensorflow.python.compiler.tensorrt import trt_convert as trt 
import numpy as np 
import sys 
from timeit import default_timer as timer 
from tensorflow.python.platform import gfile 
from tensorflow.core.protobuf import saved_model_pb2 
from tensorflow.python.util import compat 
import time 
 
oModelDir = "./saved_model/" 
#oModelDir = "./saved_model/trt_model" 

iBatchSize = int(sys.argv[1]) 
iNumTimesToRun = int(sys.argv[2]) 
#1 - trt convert, else tensorflow     
iTrtConvert = int(sys.argv[3])     

#Data for inference - random data. Input size hardcoded 
data = np.random.rand(416, 416, 3) 
data = (data*255).astype(np.uint8) 
data = np.expand_dims(data, 0) 
data = np.repeat(data, iBatchSize, axis=0)      

with gfile.FastGFile(oModelDir+"/saved_model.pb", 'rb') as f: 
    file_read = compat.as_bytes(f.read()) 
    sm = saved_model_pb2.SavedModel() 
    sm.ParseFromString(file_read) 

    with tf.Graph().as_default() as dumyGraph :  
        tf.import_graph_def(sm.meta_graphs[0].graph_def) 
        operations = dumyGraph.get_operations() 
        operations =[op for op in operations if op.type!="NoOp"] 
        input_node = (dumyGraph.get_tensor_by_name(operations[0].name+":0")) 
        output_node = (dumyGraph.get_tensor_by_name(operations[-1].name+":0")) 
        outputName = output_node.name.split("/")[-1] 

        #Convert to trt engine 
        if iTrtConvert == 1: 
            converter = trt.TrtGraphConverter(input_saved_model_dir=oModelDir, 
                                              nodes_blacklist=[outputName], 
                                              max_batch_size=iBatchSize, 
                                    max_workspace_size_bytes=2000,precision_mode="FP16")  
            trt_graph = converter.convert() 

            with tf.Graph().as_default() as dumyGraph: 
                tf.import_graph_def(trt_graph) 
                input_node = dumyGraph.get_tensor_by_name(operations[0].name+":0") 
                output_node = dumyGraph.get_tensor_by_name(operations[-1].name+":0") 
            #Save model if needed 
            #converter.save("./saved_model/trt_model/")                 

        #Start a session                 
        sess = None 
        with tf.device("/device:GPU:0"): 
            cfg = dict({'allow_soft_placement': True,'log_device_placement': False}) 
            cfg['gpu_options'] = tf.GPUOptions(per_process_gpu_memory_fraction = 0.3,  allow_growth = True) 
            cfg['allow_soft_placement'] = False 
            cfg['device_count'] = {'GPU': 1} 
            sess =tf.compat.v1.Session(graph=dumyGraph, config = tf.compat.v1.ConfigProto(**cfg)) 
 
        #Warmup run 
        output = sess.run([output_node], feed_dict={input_node: np.array(data)})     
        start_timer = timer() 
        for i in range(iNumTimesToRun): 
            output = sess.run([output_node], feed_dict={input_node: np.array(data)})     
        end_timer = timer() 
        total_time = end_timer - start_timer 
        average_time = total_time/float(iNumTimesToRun) 
        print (len(output[0])) 
        print ("Total time : ", total_time) 
        print ("Average time(ms)/image : ", (average_time/iBatchSize)*1000) 
        print ("FPS : ", 1/(average_time/iBatchSize))

AakankshaS · April 27, 2021, 4:22am

Hi @krupa.gopal ,
We request you to share your onnx model and the logs with us so that we can try this at our end?
Thanks!

krupa.gopal · April 27, 2021, 8:19am

I’m uploading a pre-trained model which shows similar behavior as I can’t upload the exact model we are using. Google drive link - saved_model_zip - Google Drive
(was facing upload error while uploading it here)
Have also modified the code to the exact one that I’m running

Running the code with
python inference.py 16 1000 0 —> will run without trt conversion and CPU memory is around 2.6 GB

python inference.py 16 1000 1 —> will run with trt conversion and CPU memory is around 8.5 GB

Replacing oModelDir with “saved_model/trt_model” to use the saved trt model and running with trt conversion off (statically loading model)
python inference.py 16 1000 0 —> will run static trt converted model and CPU memory is around 6.6 GB

krupa.gopal · April 29, 2021, 10:19am

Hi,

Any update on the above issue?

spolisetty · May 11, 2021, 3:58pm

Hi @krupa.gopal,

Sorry for the delayed response. When we run script you’ve shared, we observed following output. Based on output looks performance is improved with trt conversion. Could you please let us know which method are you following to know CPU memory usage.

$ python inference.py 16 1000 0
Total time : 126.4752559561748
Average time(ms)/image : 7.904703497260926
FPS : 126.50695884374562

$ python inference.py 16 1000 1
Total time : 69.24931123992428
Average time(ms)/image : 4.328081952495268
FPS : 231.04922942216248

Thank you.

krupa.gopal · May 12, 2021, 6:07am

Hi,

Thanks for the response.

This is run through Nvidia TF docker - (Nvidia Release 20.02-tf1)
I use “docker stats” to note down the CPU memory usage of the Nvidia TF docker image that is running - specifically the “MEM USAGE / LIMIT” and “MEM %” columns

spolisetty · May 17, 2021, 11:02am

Hi @krupa.gopal,

At inference time TF-TRT keeps one (or more) ICudaEngine and IExecutionContexts in memory. The engine stores a copy of all the weights, that is expected to increase the memory consumption by 50%.

We request you to try following and share us output logs.

Try to serialize engine to a plan file and use trtexec to run it.

You can then look at TRT only memory utilization.
Example,

[04/07/2021-10:53:37] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[04/07/2021-10:53:43] [I] [TRT] Total Host Persistent Memory: 10400
[04/07/2021-10:53:43] [I] [TRT] Total Device Persistent Memory: 222743552
[04/07/2021-10:53:43] [I] [TRT] Total Scratch Memory: 262144
[04/07/2021-10:53:43] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 392 MiB, GPU 4 MiB
[04/07/2021-10:53:43] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1321, GPU 1801 (MiB)
[04/07/2021-10:53:43] [I] [TRT] [MemUsageSnapshot] Builder end: CPU 1321 MiB, GPU 1775 MiB
[04/07/2021-10:53:43] [I] Engine built in 81.9529 sec.
[04/07/2021-10:53:43] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1228, GPU 1785 (MiB)
[04/07/2021-10:53:43] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 1229, GPU 1793 (MiB)

For you reference, trtexec documentation.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

Thank you.

krupa.gopal · May 19, 2021, 6:33am

Hi,

Thanks for the reply.

Yes, we have tried using TensorRT library directly and it does show lesser CPU memory consumption. But, Tensorflow is still a preferable inference engine for us (mainly for ease of use and scalability)

The memory consumption is not just 50% more - its of the order of 3x
By default maximum_cached_engines is 1 - yet, multiple engines are getting cached and this contributes to the CPU memory usage?
Is this something Nvidia plans to take a look at and resolve in its TF integration?

spolisetty · May 20, 2021, 9:48am

Hi @krupa.gopal,

Could you please let us know how many engines are created? If it is a large number, then it might explain the overhead. If we increase the minimum_segment_size , then the number of engines are decreased, and therefore the memory usage decreases (both host and device). Many small engines are bad for performance anyways.
Regarding remain queries will get back to you.

Thank you.

spolisetty · May 24, 2021, 10:29am

Hi @krupa.gopal,

Please allow us sometime to work on this issue.

Thank you.

krupa.gopal · May 25, 2021, 10:24am

Sure.

Just wanted to update on the previous reply,
maximum_cached_engines is 1
minimum_segment_size is the default value - 3
use_function_backup - this variable didn’t work and gave an error when tried to set it to False (default is True)

Topic		Replies	Views
Tf-trt conversion got killed TensorRT tensorrt , tensorflow , jetson-inference	3	747	April 22, 2021
Trying to run TensorFlow 1.15 produced graphdefs with TF2 based tensorRT but TensorRT model is not building correctly TensorRT tensorrt , tensorflow , python , inference-server-triton , machine-learning	4	950	May 13, 2021
TF-TRT not generating .engine file TensorRT	1	726	May 18, 2022
TensorRT (TF-TRT) doesn't improve TF model in GeForce 1060? TensorRT	7	2911	January 18, 2019
tensorflow.python.framework.errors_impl.OpError: file is too short to be an sstable TensorRT tensorrt , tensorflow , jetson-inference	1	1703	July 28, 2021
TF-TRT optimization TensorRT tensorrt , tensorflow , jetson-inference	4	4949	June 2, 2021
No SpeedUp after TensorRT INT8 (PointNet ++ tensorflow model) TensorRT	6	1253	February 25, 2020
Lower performance with TRT than plain TF? Jetson Xavier NX tensorrt , jetson-inference	14	1956	October 18, 2021
Trying to run TensorFlow 1.15 produced graphdefs with TF2 based tensorRT but TensorRT model is not building correctly TensorRT	6	991	July 15, 2021
"Engine buffer is full" with Tensorflow-TensorRT Integration TensorRT	3	1951	April 25, 2019

Tensorflow inference using TRT converted model

Description

Environment

Steps To Reproduce

Related topics