I’m measuring inference times for MobilenetV2 model.
I’m doing two different processes:
1 - I build a TRT-engine from ONNX file and run inference directly with that engine.
2 - I build a TRT-engine from ONNX file, serialize that engine and save it into a file. Then, I deserialize the engine from that file and run inference.
I found that when I run inference from the serialized engine, inference times are much slower than when running without saving and loading the engine.
Function for building the engine is:
def build_engine_onnx(model_file):
** with trt.Builder(TRT_LOGGER) as builder, builder.create_network(common.EXPLICIT_BATCH) as network, trt.OnnxParser(network, TRT_LOGGER) as parser:**
** builder.max_workspace_size = common.GiB(1)**
** # Load the Onnx model and parse it in order to populate the TensorRT network.**
** with open(model_file, ‘rb’) as model:**
** if not parser.parse(model.read()):**
** print (‘ERROR: Failed to parse the ONNX file.’)**
** for error in range(parser.num_errors):**
** print (parser.get_error(error))**
** return None**
** network_inputs = [network.get_input(i) for i in range(network.num_inputs)]**
** input_names = [_input.name for _input in network_inputs]**
** config = builder.create_builder_config()**
** profile = builder.create_optimization_profile()**
** profile.set_shape(‘input_2:0’, (1, 224, 224, 3),(1, 224, 224, 3),(1, 224, 224, 3))**
** config.add_optimization_profile(profile)**
** return builder.build_engine(network,config)**
Function for saving engine:
def save_engine(engine, file_name):
** buf = engine.serialize()**
** with open(file_name, ‘wb’) as f:**
** f.write(buf)**
Function for loading engine:
def load_engine(trt_runtime, plan_path):
** with open(plan_path, ‘rb’) as f:**
** engine_data = f.read()**
** engine = trt_runtime.deserialize_cuda_engine(engine_data)**
** return engine**
The function I’m using for inference is the same in both cases.