TensorRT results in reduced accuracy and performance


Using TensorRT on a trained model is resulting in a significant decrease in both performance and accuracy. Inference time has dropped from ~10ms/frame to ~90 ms/frame and there are noticeable differences from the original output. Are there parameters I should be looking to change in order to improve accuracy and/or performance?


TensorRT Version:
GPU Type: Tesla P100
Nvidia Driver Version: 418.152.00
CUDA Version: 10.1
CUDNN Version: 7.6.5
Operating System + Version: CentOS 8
Python Version (if applicable): 3.6
TensorFlow Version (if applicable): 1.11.0
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

To generate pb file:

def keras_to_frozen_pb(model_in_path,
Converter that transforms keras model to frozen pb model

    model_in_path (str): Input model path (.h5)
    model_out_path (str): Output model path (dir)
    tensor_out_name (str, optional): Specified name of output tensor.
                                     If None, it will get default tensor name from keras model.
                                     Defaults to None.
    tensorboard_dir (str, optional): Output tensorboard dir path for inspecting output model graph.
                                     If None, it doesn't generate.
                                     Defaults to None.

graph = tf.Graph()
with graph.as_default():
    sess = tf.Session()

    # load the model to graph and sess
    model = tf.keras.models.load_model(model_in_path, custom_objects=custom_object_dict)
    print("Detected Inputs: " +str(model.inputs))
    print("Detected Outputs: " + str(model.outputs))
    # get the tensor_out_name
    if tensor_out_name is None:
        if len(model.outputs) > 1:
            raise NameError("the model has multiple output tensor. Need to specify output tensor name.")
            tensor_out_name = model.outputs[0].name.split(":")[0]
    # Log the graph
    # freeze the graph
    graphdef = tf.graph_util.convert_variables_to_constants(sess, graph.as_graph_def(), tensor_out_name)
    graphdef = tf.graph_util.remove_training_nodes(graphdef)
    graph_io.write_graph(graphdef, './', model_out_path, as_text=False)

    # output tensorboard graph
if not tensorboard_dir is None:
    tf.summary.FileWriter(logdir=tensorboard_dir, graph_def=graphdef)

return tensor_out_name

To build engine:

def build_engine_uff(model_file,algo_params_dict):

with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.UffParser() as parser:
    builder.max_workspace_size = GiB(3)
    builder.max_batch_size = 5

    # We need to manually register the input and output nodes for UFF.
    parser.register_input(algo_params_dict["INPUT_NAME"], tuple(algo_params_dict["INPUT_SHAPE"]))
    # Load the UFF model and parse it in order to populate the TensorRT network.
    parser.parse(model_file, network)
   # Build and return an engine.
    return builder.build_cuda_engine(network)

To run inference:

def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
# Transfer input data to the GPU.
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
# Run inference.
context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
# Synchronize the stream
# Return only the host outputs.
return [out.host for out in outputs]

Hi @Alex.Watras,
Request you to check the parameters mentioned in the link below to optimize the performance.