Hi,
I am working on implementing inference using a custom neural network model. The model is in FP32, and it has roughly 4 MB parameters. Currently I am getting an inference time around 80 ms. The inputs to the model is expected over the network and I am using zeroMQ to receive those inputs.
Please see the code snippet below.
//
time0 = time.time()
with get_engine(onnx_file_path, engine_file_path) as engine, engine.create_execution_context() as context:
inputs, outputs, bindings, stream = common.allocate_buffers(engine)
print('Running inference on image {}...'.format(input_image_path))
# Set host input to the image. The common.do_inference_v2 function will copy the input to the GPU before executing.
inputs[0].host = rand_img
trt_outputs = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
trt_outputs = [output.reshape(shape) for output, shape in zip(trt_outputs, output_shapes)]
np.save(output_path, trt_outputs)
time1 = time.time()
//
tensorrt engine is being read from the engine file path every time an input image is received.
I believe this is hurting the inference times.
Is there any way in which I am reading the engine once and perform the inference every time an input image is received?
Note: I am running the inference in Max N mode in AGX