I am trying to use TensorRT IOptimizationProfiles to compile an engine from tensorflow via ONNX to TensorRT, for use with different image matrix sizes.
The conversion/compile works and also the inference using the different profiles works. What confuses me is the CUDA memory used by the IExecutionContext, which I track with cudaMemGetInfo. For compiling the ONNX model to TensorRT I use custom code as trtexec does not seem to allow to specify more than one optimization profile. For the optimization profiles I use setExtraMemoryTarget(0) to keep additional memory as low as possible (however, in my experiments this setting did not really make a difference anyway).
For instance, for my custom model and a single optimization profile with a shape of (1,1,896,896) the context uses 264 MB cuda memory.
For the two shapes (and thus two optimization profiles) (1,1,896,896) and (1,1,768,768) I get a CUDA memory consumption for the context of 633 MB, which is MORE than double the memory used by the single (larger) profile.
What is going on here? Is this expected behavior? What is the use of optimization profiles if I can get less memory consumption (ok minus the memory taken up by the model itself) if I just compile and load 2 separate engines compared to having 1 engine with optimization profiles?
Is there any further documentation of the usage of optimization profiles other than the C++ API reference manual?
TensorRT Version: 7.2.3
GPU Type: Quadro P2000
Nvidia Driver Version: 184.108.40.20606
CUDA Version: 11.0
CUDNN Version: 8.1
Operating System + Version: Win10
Python Version (if applicable): 3.6
TensorFlow Version (if applicable): 1.15.3