Reducing the latency during inference in AGX Xavier

Hi,

I am working on implementing inference using a custom neural network model. The model is in FP32, and it has roughly 4 MB parameters. Currently I am getting an inference time around 80 ms. The inputs to the model is expected over the network and I am using zeroMQ to receive those inputs.

Please see the code snippet below.
//

time0 = time.time()
with get_engine(onnx_file_path, engine_file_path) as engine, engine.create_execution_context() as context:
    inputs, outputs, bindings, stream = common.allocate_buffers(engine)
    
    print('Running inference on image {}...'.format(input_image_path))
    # Set host input to the image. The common.do_inference_v2 function will copy the input to the GPU before executing.

    inputs[0].host = rand_img
    trt_outputs = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
     
trt_outputs = [output.reshape(shape) for output, shape in zip(trt_outputs, output_shapes)]
np.save(output_path, trt_outputs)

time1 = time.time()

//

tensorrt engine is being read from the engine file path every time an input image is received.
I believe this is hurting the inference times.

Is there any way in which I am reading the engine once and perform the inference every time an input image is received?

Note: I am running the inference in Max N mode in AGX

Hi,

In general, you just need to reflash the data in inputs buffer.
You don’t need to create the buffer and deserialize the engine each time.

Thanks.

Thanks!