2.5GB of video memory missing in TensorFlow on both Linux and Windows [RTX 3080]

Description

I have a 10GB 3080RTX GPU, nvidia-smi reports 10014MiB memory, Tensorflow reports:

Created device /job:localhost/replica:0/task:0/device:GPU:0 with 7591 MB memory

After initial research I was convinced that this is related to Windows 10 OS limitations, so I installed Ubuntu 20.04 in dual boot. It didn’t change anything, I tried various versions of Tensorflow, Cuda, Cudnn.
I tried using:

physical_devices = tf.config.list_physical_devices('GPU')
for gpu_instance in physical_devices:
    tf.config.experimental.set_memory_growth(gpu_instance, True)

It didn’t fix the problem. Also, I tried:

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 1.0
session = InteractiveSession(config=config)

And indeed, TensorFlow started to report proper full 10GB of memory in ‘Created device’ message, so tf should see the memory properly. With this method I was able to push memory to something like 8GB and it even allowed me to run slightly higher batch size. But, if I specify fraction of more than 0.8 (it may slightly vary from run-to-run) than i have:

2021-09-26 12:48:26.691479: F tensorflow/core/util/cuda_solvers.cc:115] Check failed: cusolverDnCreate(&cusolver_dn_handle) == CUSOLVER_STATUS_SUCCESS Failed to create cuSolverDN instance.

One important thing to note, is that while TensorFlow is reporting a device with ~7.5GB, in nvidia-smi it is reporting more than 9GB by /usr/bin/python3! I am not running any other Python script in parallel.

So, the memory usage in reality is reaching its limits while I am able to use only 7.5GB, which is even less than known 81% limitation for Windows 10 users! Why am I being allocated almost extra 2GB on top which I can’t use?

I was trying to fix it for a long time and really don’t have any idea what to do now. Other people’s problems with missing tf memory that I found on Internet were related to Windows OS, mine is not. Am I missing something? I would really appreciate any idea on what is going on.
Thank you in advance.

Environment

GPU Type: MSI RTX 3080 10GB
Nvidia Driver Version: 470.63.01
CUDA Version: 11.2, 11.4
CUDNN Version: 8.1.0, 8.2.4
Operating System + Version: Ubuntu 20.04, Windows 10
Python Version (if applicable): 3.8.10
TensorFlow Version (if applicable): 2.5.0, 2.5.1, 2.6.0

Steps To Reproduce

The problem arises both for my custom code and for sample tensorflow scripts from official tutorials. So it should be not code-dependent.

1 Like

Hi,

This forum talks more about updates and issues related to TensorRT. We recommend you to please post your concern on Tensorflow related platform to get better help.

Thank you.

1 Like

I noticed that other people also have this problem with RTX 3000 series cards. I tried using RTX 2000 series and I don’t have that big memory allocation. Can the problem be related to Cuda/CuDNN?

I raised an issue in TensorFlow repository about this problem too, will write updates here If I get some info.

Hi,

We suggest to check with Tensorflow team first, if they have some workaround for this, found similar issues in tensorflow related platform. Based to their inputs/suggestion you can reach out to Nvidia.

Thank you.

I tested some networks on PyTorch 1.9.1 + CUDA 11.1 and facing similar issue. For example, on Windows right before running the script I have 520MiB / 10240MiB allocated:

image

After I run the training I get:

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.00 GiB total capacity; 7.39 GiB already allocated; 0 bytes free; 7.44 GiB reserved in total by PyTorch)

nvidia-smi shows that almos all available memory is allocated:

image

PyTorch info:

So again very similar issue. Pytorch allocated 7.39 GiB, nvidia-smi shows memory usage increased by ~9.5GiB after I launched the script, extra 2GB is taken for unknown reason, as with Tensorflow.

Hi,

This doesn’t looks like tensorrt related. Which container image are you using ?

Thank you.

Hi, I’m not using container but running locally from Windows and clean installed Ubuntu 20.04

Hi,

This looks like out of scope for TensorRT. We recommend you to please post your concern on forum related to library with you’re facing an issue.

Thank you.