Manage TensorRT GPU memory conversion usage

Description

Hello everyone,

I recently updated to Tensorflow to 2.3.2, therefore using TrtGraphConverterV2 to convert my models to TensorRT.

I deploy in environments where I’m not totally in control of the GPU memory, so I need to parametrize it so that I’m sure it does not impact other running processes. In TrtV1, I could specify the GPU memory allocated to the conversion by passing in the config the gpu_options like so:

config.gpu_options.per_process_gpu_memory_fraction

but in TrtV2 this parameter was removed.

Do you know if it is still possible?

Environment

tensorflowrt 7.2.2.3:
cuda 10.2:
cudnn 8.0:
Ubuntu 18.04:
Python Version 3.6:
TensorFlow Version 2.3.2:
** nvidia/cuda:10.2-cudnn8-devel-ubuntu18.04**:

Hi @iacopo.breschi,

per_process_gpu_memory_fraction is a TF1 option. It indirectly affects TF-TRT, because TF-TRT is using memory through the TF memory allocator, so any TF memory limit will apply to TF-TRT.
In TF2 the same is true: TF-TRT is using memory from the TF memory budget, so the TF2 memory limit shall restrict the memory consumption of TF-TRT.

Following link will help you on how to set the memory limit in TF2,

Apart from this TF-TRT has an option to control the workspace size. You can create a conversion param to control that:

from tensorflow.python.compiler.tensorrt import trt_convert as trt
conv_param = trt.TrtConversionParams(max_workspace_size_bytes=1<<30)
converter = trt.TrtGraphConverterV2(
input_saved_model_dir=’/path/to/model_dir’,
conversion_params=conv_param)

This parameter is passed directly to TRT, you can find some relevant notes from the TRT developer guide in faq section How do I choose the optimal workspace size?

To estimate TF-TRT memory usage, note that a TF-TRT converted model has to hold an extra copy of the weights for TRT. So if you have a TF model with 1GiB of weights, when fully converted it would need 2GiB in FP32 mode or 1.5 GiB in FP16 mode (in this sense the converter precision controls the memory usage). On top of this there is some extra memory needed for activation buffers and for the generated TRT engine code.

One thing to keep in mind while switching to TF2 is that in TF2 the TRT engine creation is actually done when the first inference runs (dynamic mode=True, there is no static mode in TF2). To trigger engine creation one should call converter.build() before the model is saved (here is an example).

Thank you.

Thanks for the reply @spolisetty,

I’m aware on how to set the memory budget for the runtime, but thanks anyway for the explanation about the weights consumption.

As you mention, we are in dynamic mode, so our model is optimized at the first inference. We have already set the workspace size, but the model optimization takes all the gpu memory our card has.
Our problem is that we want to trade optimization time with memory consumption. Before it was possible, now it seems that the gpu options config has been removed.

Regards

1 Like

Hi @iacopo.breschi ,

Sorry for delayed response. Are you still facing this issue.
Could you please confirm have you tried 1st method as mentioned the article in previous reply. If you’re still facing an issue, we request you to provide more details and nvidia-smi output.

Thank you.