Correct way for reload engine for save memory

Hello everybody!

For example, I have 2 functions of loading and inference engine:

def load_trt_model(file_name):
f = open(file_name, “rb”)
runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING))
engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()
h_input = np.zeros(SHAPE, dtype=DT)
h_output = np.zeros(OUT_SHAPE, dtype=DT)
# Allocate device memory for inputs and outputs.
d_input = cuda.mem_alloc(h_input.nbytes)
d_output = cuda.mem_alloc(h_output.nbytes)
# Create a stream in which to copy inputs/outputs and run inference.
stream = cuda.Stream()
return context, stream, d_input, d_output, h_output

def predict(X, context ,stream, d_input, d_output, h_output):
# Transfer input data to device
cuda.memcpy_htod_async(d_input, X, stream)
bindings = [int(d_input),int(d_output)]
# Execute model
context.execute_async_v2(bindings, stream.handle, None)
# Transfer predictions back
cuda.memcpy_dtoh_async(h_output, d_output, stream)
# Syncronize threads
stream.synchronize()
return h_output

I need change the model engine each iteration:

tV =
for i in range(20):
X = DT(np.random.random((BATCH_SIZE,224,224,1))*12)
engine, d_input, d_output, h_output = load_trt_model(file_name = model_pathes[i], inp_shape = X.shape, out_shape = (BATCH_SIZE,1000))
res = predict(X, context, d_input, d_output, h_output)
d_input.free()
d_output.free()
del engine, res

In this case, I see the constantly increasing memory… If I set more models I will achieve the limit of memory… What is my error and which command I must apply for solve this problem?

Moved to the CUDA category (as it does not use CUPTI APIs).

Did you solve this? I am having the exact same issue.