We’re using TRT 5.0 python to run our model and find that the CPU RAM consumption is about 2.6G of memory. We find that
- 1.1G is consumed when creating the TRT runtime itself
- 1.5G additionally used after the call to deserialize_cuda_engine.
This size does not seem to vary by much based on the model’s input size or FP16 vs FP32. We’ve also checked with using a C+±only inference engine, and get lower but still very high memory usage of approx 1.9G. We checked different models with GPU memory usage between 0.8-1.4G.
There is a setting called max_workspace_size, which can affect the amount of consumed GPU memory, but in our case modifying this value did not produce significant differences.
My questions are:
- are these large values expected, or is the expected memory usage significantly lower?
- how can we reduce the RAM usage? We aim for less than 0.5G RAM
Thanks,
Ran
Details:
output produced by a profiling tool showing the memory increase per line:
Line # Mem usage Increment Line Contents
85 368.973 MiB 368.973 MiB @profile
86 def load_engine(trt_filename):
87 pass # logger.info("Reading engine from file {}".format(trt_filename))
88 # with open(trt_filename, "rb") as trt_file, trt.Runtime(get_trt_logger()) as runtime:
89 # return runtime.deserialize_cuda_engine(trt_file.read())
90 368.973 MiB 0.000 MiB trt_file = open(trt_filename, "rb")
91 1477.680 MiB 1108.707 MiB runtime = trt.Runtime(get_trt_logger())
92 1537.953 MiB 60.273 MiB trt_file_contents = trt_file.read()
93 3041.938 MiB 1503.984 MiB engine = runtime.deserialize_cuda_engine(trt_file_contents)
94 3041.938 MiB 0.000 MiB trt_file.close()
95 3041.938 MiB 0.000 MiB return engine