as part of the initialisation step, I apply a single inference prior to really activating the system.
after this inference, sometimes, the used RAM is much higher than it should be.
most cases it would be around:
RAM = 6720/7860 (from jopt() stat)
but at times I measure:
RAM = 7614/7860 (from jopt() stat)
or higher.
if this happens, I get inference times that are 1[sec] or 7[sec] or more.
any idea why I get such differences in the RAM usage after first inference (this is a unit test, so same image and data each time, same code of course)?
Would there be something else to measure / restrict?
We will recommend you to use pure TensorRT for inference instead.
TF-TRT use TensorFlow interface so by default it occupies most of GPU memory to allow fast algorithm.
You can try to export model .pb->.uff(.onnx)->.trt to get a better performance with pure TensorRT.
Here is an example for your reference: