A clear and concise description of the bug or issue.
GPU Type: Tesla T4
Nvidia Driver Version: 450.80.02
CUDA Version: 11.0
Operating System + Version: Ubuntu + 18.04
Python Version (if applicable): 3.6
TensorFlow Version (if applicable): 2.4.1
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):
IVER_API, cbid)failed with error CUPTI could not be loaded or symbol could not be found. 2021-03-08 08:25:35.572927: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 2247026860 exceeds 10% of free system memory. [libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/wire_format_lite.cc:523] CHECK failed: (value.size()) <= (kint32max):
I get this issue when I use a large dataset for training. I have tried reducing the batch size but this also didn’t help. On the other hand, if I reduce my training data, it works.
Can you guide me what is the reason behind this error? Is this because there is some training data limit? Or some other reason?
My script uses MirroredStrategy and deals with the distributed dataset to achieve data parallelism among the workers.