Libprotobuf FATAL Error on Tensorflow Training

Description

A clear and concise description of the bug or issue.

Environment

TensorRT Version:
GPU Type: Tesla T4
Nvidia Driver Version: 450.80.02
CUDA Version: 11.0
CUDNN Version:
Operating System + Version: Ubuntu + 18.04
Python Version (if applicable): 3.6
TensorFlow Version (if applicable): 2.4.1
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

ERROR

IVER_API, cbid)failed with error CUPTI could not be loaded or symbol could not be found. 2021-03-08 08:25:35.572927: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 2247026860 exceeds 10% of free system memory.
[libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/wire_format_lite.cc:523] CHECK failed: (value.size()) <= (kint32max):

I get this issue when I use a large dataset for training. I have tried reducing the batch size but this also didn’t help. On the other hand, if I reduce my training data, it works.
Can you guide me what is the reason behind this error? Is this because there is some training data limit? Or some other reason?

My script uses MirroredStrategy and deals with the distributed dataset to achieve data parallelism among the workers.

Hi,
We recommend you to check the below samples links, as they might answer your concern

If issue persist, request you to share the model and script so that we can try reproducing the issue at our end.
Thanks!

Hi @guneet.

Based on the above description, It doesn’t look like TensorRT related issue.
You may get better help here,

Thank you.

@guneet How were you able to solve this issue? I am also getting same error and it works if I reduce the dataset.