How to load tensorrt engine directly with building on runtime


I am using the onnx_tensorrt library to convert onnx model to tensorrt model on runtime. But since it is building the tensorrt engine on runtime it takes more than 4minutes to complete. So i want to use direct tensorrt engine file directly without building on runtime. For this, i have converted onnx model to tensorrt engine .plan file offline. but i dont know how to use it directly in my code. Please help me with it.

import onnx
import onnx_tensorrt.backend as backend
import numpy as np

model = onnx.load("/path/to/model.onnx")
engine = backend.prepare(model, device=‘CUDA:1’)
input_data = np.random.random(size=(32, 3, 224, 224)).astype(np.float32)
output_data =[0]

currently i am using above code. Now i have .plan file with me ready. please let me know that how to inference .plan file directly.thanks


You can find an example to deploy an ONNX model with TensorRT API directly below:


Getting following error.

File “”, line 42, in Inference
np.copyto(host_inputs[0], image.ravel())
IndexError: list index out of range

Source code:

import cv2
import time
import numpy as np
import tensorrt as trt
import pycuda.autoinit
import pycuda.driver as cuda
import os

EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
TRT_LOGGER = trt.Logger(trt.Logger.INFO)
runtime = trt.Runtime(TRT_LOGGER)

host_inputs =
cuda_inputs =
host_outputs =
cuda_outputs =
bindings =
def load_engine(trt_runtime, engine_path):
with open(engine_path, ‘rb’) as f:
engine_data =
engine = trt_runtime.deserialize_cuda_engine(engine_data)
return engine

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)
trt_engine_path = “onnxx.plan”
trt_engine = load_engine(trt_runtime, trt_engine_path)

def Inference(engine):
image = cv2.imread(“1.jpeg”)

image = (2.0 / 255.0) * image.transpose((2, 0, 1)) - 1.0

np.copyto(host_inputs[0], image.ravel())
stream = cuda.Stream()
context = engine.create_execution_context()

start_time = time.time()
cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)
context.execute_async(bindings=bindings, stream_handle=stream.handle)
cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)
print("execute times "+str(time.time()-start_time))

output = host_outputs[0].reshape(np.concatenate(([1],engine.get_binding_shape(1))))



The error is caused by the host_inputs parameter.
It seems that you don’t allocate the buffer for host_inputs, cuda_inputs, host_outputs, cuda_outputs.

Please check the sample code and add it to your source.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.