Reducing the latency during inference in AGX Xavier


I am working on implementing inference using a custom neural network model. The model is in FP32, and it has roughly 4 MB parameters. Currently I am getting an inference time around 80 ms. The inputs to the model is expected over the network and I am using zeroMQ to receive those inputs.

Please see the code snippet below.

time0 = time.time()
with get_engine(onnx_file_path, engine_file_path) as engine, engine.create_execution_context() as context:
    inputs, outputs, bindings, stream = common.allocate_buffers(engine)
    print('Running inference on image {}...'.format(input_image_path))
    # Set host input to the image. The common.do_inference_v2 function will copy the input to the GPU before executing.

    inputs[0].host = rand_img
    trt_outputs = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
trt_outputs = [output.reshape(shape) for output, shape in zip(trt_outputs, output_shapes)], trt_outputs)

time1 = time.time()


tensorrt engine is being read from the engine file path every time an input image is received.
I believe this is hurting the inference times.

Is there any way in which I am reading the engine once and perform the inference every time an input image is received?

Note: I am running the inference in Max N mode in AGX


In general, you just need to reflash the data in inputs buffer.
You don’t need to create the buffer and deserialize the engine each time.