Tensorflow inference using TRT converted model

Description

Running trt converted model using Tensorflow 1.15.2 (Nvidia Release 20.02-tf1) takes up too much of CPU RAM than expected (~7 GB) as opposed to running the model without conversion to trt (~2.5 GB)
This issue is irrespective of whether model is converted dynamically or statically.

Environment

TensorRT Version:
GPU Type: GeForce RTX 2080
Nvidia Driver Version: 440.33.01
CUDA Version: 10.2
CUDNN Version: 7.6.5
Operating System + Version: Ubuntu 18.04.3 LTS
Python Version (if applicable): 3.6.9
TensorFlow Version (if applicable): 1.15.2
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): Nvidia Release 20.02-tf1

Steps To Reproduce

import os 
os.environ["CUDA_VISIBLE_DEVICES"] = "0" 
import tensorflow as tf 
from tensorflow.python.compiler.tensorrt import trt_convert as trt 
import numpy as np 
import sys 
from timeit import default_timer as timer 
from tensorflow.python.platform import gfile 
from tensorflow.core.protobuf import saved_model_pb2 
from tensorflow.python.util import compat 
import time 
 
oModelDir = "./saved_model/" 
#oModelDir = "./saved_model/trt_model" 

iBatchSize = int(sys.argv[1]) 
iNumTimesToRun = int(sys.argv[2]) 
#1 - trt convert, else tensorflow     
iTrtConvert = int(sys.argv[3])     

#Data for inference - random data. Input size hardcoded 
data = np.random.rand(416, 416, 3) 
data = (data*255).astype(np.uint8) 
data = np.expand_dims(data, 0) 
data = np.repeat(data, iBatchSize, axis=0)      

with gfile.FastGFile(oModelDir+"/saved_model.pb", 'rb') as f: 
    file_read = compat.as_bytes(f.read()) 
    sm = saved_model_pb2.SavedModel() 
    sm.ParseFromString(file_read) 

    with tf.Graph().as_default() as dumyGraph :  
        tf.import_graph_def(sm.meta_graphs[0].graph_def) 
        operations = dumyGraph.get_operations() 
        operations =[op for op in operations if op.type!="NoOp"] 
        input_node = (dumyGraph.get_tensor_by_name(operations[0].name+":0")) 
        output_node = (dumyGraph.get_tensor_by_name(operations[-1].name+":0")) 
        outputName = output_node.name.split("/")[-1] 

        #Convert to trt engine 
        if iTrtConvert == 1: 
            converter = trt.TrtGraphConverter(input_saved_model_dir=oModelDir, 
                                              nodes_blacklist=[outputName], 
                                              max_batch_size=iBatchSize, 
                                    max_workspace_size_bytes=2000,precision_mode="FP16")  
            trt_graph = converter.convert() 

            with tf.Graph().as_default() as dumyGraph: 
                tf.import_graph_def(trt_graph) 
                input_node = dumyGraph.get_tensor_by_name(operations[0].name+":0") 
                output_node = dumyGraph.get_tensor_by_name(operations[-1].name+":0") 
            #Save model if needed 
            #converter.save("./saved_model/trt_model/")                 

        #Start a session                 
        sess = None 
        with tf.device("/device:GPU:0"): 
            cfg = dict({'allow_soft_placement': True,'log_device_placement': False}) 
            cfg['gpu_options'] = tf.GPUOptions(per_process_gpu_memory_fraction = 0.3,  allow_growth = True) 
            cfg['allow_soft_placement'] = False 
            cfg['device_count'] = {'GPU': 1} 
            sess =tf.compat.v1.Session(graph=dumyGraph, config = tf.compat.v1.ConfigProto(**cfg)) 
 
        #Warmup run 
        output = sess.run([output_node], feed_dict={input_node: np.array(data)})     
        start_timer = timer() 
        for i in range(iNumTimesToRun): 
            output = sess.run([output_node], feed_dict={input_node: np.array(data)})     
        end_timer = timer() 
        total_time = end_timer - start_timer 
        average_time = total_time/float(iNumTimesToRun) 
        print (len(output[0])) 
        print ("Total time : ", total_time) 
        print ("Average time(ms)/image : ", (average_time/iBatchSize)*1000) 
        print ("FPS : ", 1/(average_time/iBatchSize))

Hi @krupa.gopal ,
We request you to share your onnx model and the logs with us so that we can try this at our end?
Thanks!

I’m uploading a pre-trained model which shows similar behavior as I can’t upload the exact model we are using. Google drive link - saved_model_zip - Google Drive
(was facing upload error while uploading it here)
Have also modified the code to the exact one that I’m running

Running the code with
python inference.py 16 1000 0 —> will run without trt conversion and CPU memory is around 2.6 GB

python inference.py 16 1000 1 —> will run with trt conversion and CPU memory is around 8.5 GB

Replacing oModelDir with “saved_model/trt_model” to use the saved trt model and running with trt conversion off (statically loading model)
python inference.py 16 1000 0 —> will run static trt converted model and CPU memory is around 6.6 GB

Hi,

Any update on the above issue?

Hi @krupa.gopal,

Sorry for the delayed response. When we run script you’ve shared, we observed following output. Based on output looks performance is improved with trt conversion. Could you please let us know which method are you following to know CPU memory usage.

$ python inference.py 16 1000 0
Total time : 126.4752559561748
Average time(ms)/image : 7.904703497260926
FPS : 126.50695884374562

$ python inference.py 16 1000 1
Total time : 69.24931123992428
Average time(ms)/image : 4.328081952495268
FPS : 231.04922942216248

Thank you.

Hi,

Thanks for the response.

This is run through Nvidia TF docker - (Nvidia Release 20.02-tf1)
I use “docker stats” to note down the CPU memory usage of the Nvidia TF docker image that is running - specifically the “MEM USAGE / LIMIT” and “MEM %” columns

Hi @krupa.gopal,

At inference time TF-TRT keeps one (or more) ICudaEngine and IExecutionContexts in memory. The engine stores a copy of all the weights, that is expected to increase the memory consumption by 50%.

We request you to try following and share us output logs.

  • Try to serialize engine to a plan file and use trtexec to run it.

You can then look at TRT only memory utilization.
Example,

[04/07/2021-10:53:37] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[04/07/2021-10:53:43] [I] [TRT] Total Host Persistent Memory: 10400
[04/07/2021-10:53:43] [I] [TRT] Total Device Persistent Memory: 222743552
[04/07/2021-10:53:43] [I] [TRT] Total Scratch Memory: 262144
[04/07/2021-10:53:43] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 392 MiB, GPU 4 MiB
[04/07/2021-10:53:43] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1321, GPU 1801 (MiB)
[04/07/2021-10:53:43] [I] [TRT] [MemUsageSnapshot] Builder end: CPU 1321 MiB, GPU 1775 MiB
[04/07/2021-10:53:43] [I] Engine built in 81.9529 sec.
[04/07/2021-10:53:43] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1228, GPU 1785 (MiB)
[04/07/2021-10:53:43] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 1229, GPU 1793 (MiB)

For you reference, trtexec documentation.

Thank you.

Hi,

Thanks for the reply.

Yes, we have tried using TensorRT library directly and it does show lesser CPU memory consumption. But, Tensorflow is still a preferable inference engine for us (mainly for ease of use and scalability)

  1. The memory consumption is not just 50% more - its of the order of 3x
  2. By default maximum_cached_engines is 1 - yet, multiple engines are getting cached and this contributes to the CPU memory usage?
  3. Is this something Nvidia plans to take a look at and resolve in its TF integration?

Hi @krupa.gopal,

Could you please let us know how many engines are created? If it is a large number, then it might explain the overhead. If we increase the minimum_segment_size , then the number of engines are decreased, and therefore the memory usage decreases (both host and device). Many small engines are bad for performance anyways.
Regarding remain queries will get back to you.

Thank you.

Hi @krupa.gopal,

Please allow us sometime to work on this issue.

Thank you.

Sure.

Just wanted to update on the previous reply,
maximum_cached_engines is 1
minimum_segment_size is the default value - 3
use_function_backup - this variable didn’t work and gave an error when tried to set it to False (default is True)