per_process_gpu_memory_fraction is a TF1 option. It indirectly affects TF-TRT, because TF-TRT is using memory through the TF memory allocator, so any TF memory limit will apply to TF-TRT.
In TF2 the same is true: TF-TRT is using memory from the TF memory budget, so the TF2 memory limit shall restrict the memory consumption of TF-TRT.
Following link will help you on how to set the memory limit in TF2,
Apart from this TF-TRT has an option to control the workspace size. You can create a conversion param to control that:
from tensorflow.python.compiler.tensorrt import trt_convert as trt
conv_param = trt.TrtConversionParams(max_workspace_size_bytes=1<<30)
converter = trt.TrtGraphConverterV2(
This parameter is passed directly to TRT, you can find some relevant notes from the TRT developer guide in faq section
How do I choose the optimal workspace size?
To estimate TF-TRT memory usage, note that a TF-TRT converted model has to hold an extra copy of the weights for TRT. So if you have a TF model with 1GiB of weights, when fully converted it would need 2GiB in FP32 mode or 1.5 GiB in FP16 mode (in this sense the converter precision controls the memory usage). On top of this there is some extra memory needed for activation buffers and for the generated TRT engine code.
One thing to keep in mind while switching to TF2 is that in TF2 the TRT engine creation is actually done when the first inference runs (dynamic mode=True, there is no static mode in TF2). To trigger engine creation one should call converter.build() before the model is saved (here is an example).