Tensorflow/TensorRT convert fails with ERROR: tensorflow.GraphDef exceeded maximum protobuf size of 2GB

Description

When I attempt to convert a Tensoflow saved model, TrtGraphConverter.convert() log shows the following error:
[libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/message_lite.cc:406] tensorflow.GraphDef exceeded maximum protobuf size of 2GB: 2697675801.
Subsequently TrtGraphConverter.save() produces an empty ‘saved_model.pb’ file.

Environment

TensorRT Version N/A:
GPU Type Tesla K80:
Nvidia Driver Version 450.80.02:
CUDA Version 11.0:
CUDNN Version V10.0.130:
Operating System + Version Amazon Linux AMI release 2018.03:
Python Version (if applicable) 3.7.9:
TensorFlow Version (if applicable) 1.15.5:
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

tensorrt_compile.log (6.6 KB)
tensorrt_compile.py (816 Bytes)

Steps To Reproduce

Please see the attached files - the script that I ran (tensorrt_compile.py) and its output (tensorrt_compile.log)

Hi, Request you to share the model and script so that we can try reproducing the issue at our end.

Also we recommend you to check the below samples links, as they might answer your concern

Thanks!

Hi @andrei.voinov,

In TF1 the default is a static engine, which is created during conversion and the serialized engine is saved in the graph as a string. This hits the protobuf’s size limitation of 2GB. The solution is to use dynamic engine, which is enabled by the is_dynamic_op=True converter arg.

Also please note that the TF1 converter sometimes has issues in dynamic mode. It is highly recommended to use the TF2, and TRTGraphConverterV2.

Thank you.

@NVES: the script is already attached. I don’t think I can share the model itself (it is proprietary). There was another helpful suggestion to try with TF v2 - I will try this first, thank you @spolisetty !

Hi, I am using TF2 and TRTGraphConverterV2 to convert a 6GB transformer saved_model. After running Converter.convert() , it works through the ‘optimizations’ time while the system ram taken by the kernel grows to 19GB before ending with an error:
ValueError: Message tensorflow.GraphDef exceeds maximum protobuf size of 2GB: 6234880301

Could you please advise?

I am facing similar issue. May I ask if you have a way to fix your issue? Thanks in advance.

Firstly, don’t beat yourself up, the documentation on all of this is TERRIBLE by both nvidia and google. The direct tensorRT conversion will not work on large models that are larger than 2GB (or did not work eight months ago). Technically you are supposed to convert to ONNX format, then use that, if I remember correctly. The process is silly, and the documentation confusing and dispersed over many websites. I mainly wanted to use tensorfloat-32 to speed up my inference, and I’m going to guess you’d like the same. After days of painful googling I discovered that there are three separate options you can consider for inference optimization: tensorRT, XLA, and JIT. XLA and JIT are already baked into newer versions of tf2, where XLA was created by nvidia, and jit was created by google. This was not easy to decipher eight months ago, and considering nvidia’s arrogant docu-dump style, and google’s passive aggressive culture, i seriously doubt it has changed. in a nutshell, XLA and JIT can optimize your code by using TF32 and INT8, but will not compress your saved model, so if you don’t need your model compressed, there is zero reason to use tensorRT. if you do absolutely need your model compressed (for deployment etc), you should look into converting to ONNX format, and then looking at that documentation. If you just want tensorfloat-32 or int-8 optimization for faster inference/training (no memory constraints), then use XLA or JIT. JIT technically CAN be more optimized than XLA, so you should go through your code and try it out to see if it gets better results (be careful that you can over-optimize, getting very quick, efficient garbage output). I found that in most cases XLA does a great job in optimizing on its own, and is the simplest solution as well, you just need to execute your code in terminal with flags like these added in front: “CUDA_VISIBLE_DEVICES=1 TF_ENABLE_AUTO_MIXED_PRECISION=1 NVIDIA_ENABLE_TF32=0 XLA_FLAGS=–xla_gpu_cuda_data_dir=/usr/local/cuda-11.4” (if you have multiple GPUs, to force GPU1 to be used. to enable AMP for auto TF32. to turn TF32 on/off, should be auto with AMP. to specify your cuda directory for XLA, because nvidia/google collab has clearly broken down). On a technical note, if your ai model can be compressed in size and still work well, then you could have trained a simpler model to begin with, so maybe you can save time by training a smaller model to begin with. for tf32 and int8 optimization to be really useful, it implies that some of your inference calculations can not be converted to tf32 or int8 without outputting garbage. In that case, you can’t use compression anyway on your model, and all you need is XLA-AMP or JIT. so, I am not sure what the purpose of tensorRT is …

AMP documentation: https://developer.nvidia.com/blog/automatic-mixed-precision-the-turbo-charging-feature-for-faster-ai/
JIT documentation: https://www.tensorflow.org/xla

ONNX conversion documentation: https://onnxruntime.ai/docs/tutorials/tf-get-started.html

1 Like

Many thanks for your post, as well as the information in it. I was going down the road to optimize my model for inference with TensorRT, but as my model is 10GB large, I had little to no success. Your post saved me tons of time. Even without setting the env variable you mentioned, just by setting jit_compile=True in the tf.function, I was able to get around 50% improvement in inference time. I would not recommend trying to get TensorRT running (especially for larger models) unless one absolutely has no other choice.