Why is GPU still used while cuda out-of-memory error occurs?

lingvisa · September 15, 2021, 3:27pm

I am using Tensorflow to perform inference on a dataset on Ubuntu. While it reports a cuda out-of-memory error, the nvidia-smi tool still shows that GPU is used, as shown below:

My code is predicting one example at a time, so no batch used. I am using GPU 0 so the the first 47% is the one my code is using. The error message is below:

INFO:tensorflow:Restoring parameters from /plu/../../model-files/model.ckpt-2683000
2021-09-09 07:49:24.230623: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 15.75G (16914055168 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-09-09 07:49:31.674556: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

My machine has a lot of memory, as shown below:

free -hm
              total        used        free      shared  buff/cache   available
Mem:           125G         16G        8.3G        1.1G        100G        107G
Swap:            0B          0B          0B

I have 2 questions:

Why is gpu still being used normally while a cuda out of memory error occurs? It seems my machine has a lot of memory. Does it mean those 107G memory is not used but all only cuda memory (16G) is used and that caused the out of memory error?

Topic		Replies	Views
cuda_driver failed_to_allocate problem CUDA_ERROR_OUT_OF_MEMORY CUDA Programming and Performance	0	1729	April 18, 2019
CUDA_ERROR_OUT_OF_MEMORY: out of memory when there is actually no such a large tensor to allocate cuDNN	1	12702	December 28, 2019
CUDA_ERROR_OUT_OF_MEMORY HELP!!! CUDA Programming and Performance	2	2863	February 13, 2018
CUDA_ERROR_OUT_OF_MEMORY: out of memory on Nvidia Quadro 8000, with more than enough available memory Frameworks tensorflow	3	2801	October 6, 2020
GTX 1080 doesn't release memory Linux	1	1145	November 5, 2017
Unable to utilize all GPU memory when using tensorflow, failed to alloate memory CUDA Programming and Performance	1	1084	October 8, 2018
CUDA_ERROR_OUT_OF_MEMORY: out of memory cuDNN cuda , tensorflow , windows-driver	1	1629	July 31, 2023
Why nvidia-smi, nor cudaMemGetInfo do not throw error with over-occupied device memory? CUDA Programming and Performance cuda	6	544	June 8, 2023
Cuda Runtime version is insufficent for CUDA runtime version. All of my versions match! CUDA Setup and Installation	2	736	October 26, 2018
out of memory CUDA Programming and Performance	11	16453	April 13, 2009

Why is GPU still used while cuda out-of-memory error occurs?

Related topics