I am using Tensorflow to perform inference on a dataset on Ubuntu. While it reports a cuda out-of-memory error, the nvidia-smi tool still shows that GPU is used, as shown below:
My code is predicting one example at a time, so no batch used. I am using GPU 0 so the the first 47% is the one my code is using. The error message is below:
INFO:tensorflow:Restoring parameters from /plu/../../model-files/model.ckpt-2683000
2021-09-09 07:49:24.230623: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 15.75G (16914055168 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-09-09 07:49:31.674556: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
My machine has a lot of memory, as shown below:
free -hm
total used free shared buff/cache available
Mem: 125G 16G 8.3G 1.1G 100G 107G
Swap: 0B 0B 0B
I have 2 questions:
Why is gpu still being used normally while a cuda out of memory error occurs? It seems my machine has a lot of memory. Does it mean those 107G memory is not used but all only cuda memory (16G) is used and that caused the out of memory error?