CUDA_ERROR_OUT_OF_MEMORY: out of memory

vladislav.abyshkin · July 28, 2023, 1:44pm

I was training a model and interrupted it during training to modify learning rate parameter. On next training, during model initialisation it started to throw errors:

2023-07-28 19:04:01.272126: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-28 19:04:02.112177: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21676 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:02:00.0, compute capability: 8.6
2023-07-28 19:04:02.522393: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 21.17G (22729785344 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:02.744404: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 19.05G (20456806400 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:02.965392: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 17.15G (18411124736 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:03.174299: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 15.43G (16570011648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:03.379019: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 13.89G (14913009664 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:03.594192: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 12.50G (13421708288 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:03.813215: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 11.25G (12079537152 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:04.028643: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 10.12G (10871582720 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:04.238775: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 9.11G (9784424448 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:04.451532: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 8.20G (8805982208 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:04.682381: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 7.38G (7925383680 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:04.901825: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 6.64G (7132845056 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:05.115339: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 5.98G (6419560448 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:05.327249: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 5.38G (5777604096 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:05.537412: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 4.84G (5199843328 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:05.749394: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 4.36G (4679858688 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:05.967369: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 3.92G (4211872768 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

It tries to allocate the memory, sometimes it successfully gets to ~8gb and initialize the model, and training process goes as usual. But process is not stable.

I’ve googled a bit, and found a kinda solution

# allow growth
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
  except RuntimeError as e:
    print(e)

It successfully initialize the model, compiles it, but crashes on first epoch:

2023-07-28 19:02:28.537433: E tensorflow/stream_executor/cuda/cuda_blas.cc:218] failed to create cublas handle: cublas error
2023-07-28 19:02:28.549723: E tensorflow/stream_executor/cuda/cuda_blas.cc:220] Failure to initialize cublas may be due to OOM (cublas needs some free memory when you initialize it, and your deep-learning framework may have preallocated more than its fair share), or may be because this binary was not built with support for the GPU in your machine.
2023-07-28 19:02:28.576924: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at matmul_op_impl.h:625 : INTERNAL: Attempting to perform BLAS operation using StreamExecutor without BLAS support
Could not load library cudnn_ops_infer64_8.dll. Error code 1455
Please make sure cudnn_ops_infer64_8.dll is in your library path!

File was still there. And it’s a bit strange that it starts to throw these errors. But I’ve reinstalled cuda, cudnn, tensorflow just in case.

To check the correct cuda+cudnn+tensorflow installation, I’ve ran import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000]))), which imports tensorflow and does some basic tensor manipulations.

2023-07-28 20:30:29.787408: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-28 20:30:30.403115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21676 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:02:00.0, compute capability: 8.6
2023-07-28 20:30:30.654068: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 21.17G (22729785344 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 20:30:30.881606: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 19.05G (20456806400 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 20:30:31.105718: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 17.15G (18411124736 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 20:30:31.335749: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 15.43G (16570011648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 20:30:31.571129: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 13.89G (14913009664 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 20:30:31.807253: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 12.50G (13421708288 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 20:30:32.043077: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 11.25G (12079537152 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 20:30:32.286931: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 10.12G (10871582720 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 20:30:32.517767: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 9.11G (9784424448 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 20:30:32.786812: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 8.20G (8805982208 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
tf.Tensor(175.96857, shape=(), dtype=float32)

As you can see, it executes the function using cuda, but before that, throws the same error.

I could think that this is a vram issue, but occt test OK(but it checks only 10gb), mods\mats check OK, gaming is OK. I couldn’t find any solution on how to test cuda perfomance with using batches of memory. I’ll get 2080ti soon to check if itll work on this machine to be sure if it’s a software issue or hardware issue. But again, gaming is ok, no artifacts.

my specs:

windows 10
rtx 3090
tensorflow <2.11

AakankshaS · July 31, 2023, 5:36am

Hi @vladislav.abyshkin ,
While i am checking on this, in the meantime can you pls confirm if there are no other processes running on the system?
Also can you please check if the below file was at toolkit\cuda\bin ?

Topic		Replies	Views
Failed call to cuInit: CUDA_ERROR_OUT_OF_MEMORY: out of memory Frameworks cuda , tensorflow	1	2921	April 22, 2021
CUDA_ERROR_OUT_OF_MEMORY: out of memory on Nvidia Quadro 8000, with more than enough available memory Frameworks tensorflow	3	2812	October 6, 2020
"Failed to get convolution algorithm" problem cuDNN	4	8482	September 7, 2019
cuda_driver failed_to_allocate problem CUDA_ERROR_OUT_OF_MEMORY CUDA Programming and Performance	0	1733	April 18, 2019
TensorFlow CUDNN_STATUS_EXECUSION_FAILED cuDNN tensorflow	1	1228	May 21, 2021
Intermittent CUDA_ERROR_ILLEGAL_ADDRESS error on Ubuntu 18.04 with TensorFlow 2.2.0 Frameworks cuda , tensorflow	3	7851	January 5, 2023
cuDNN failed to initialize cuDNN	2	1581	September 16, 2019
Failed to get convolution algorithm. This is probably because cuDNN failed to initialize cuDNN	29	51566	October 12, 2021
CUDA_ERROR_OUT_OF_MEMORY: out of memory when there is actually no such a large tensor to allocate cuDNN	1	12758	December 28, 2019
Crash on training (CUDA_ERROR_LAUNCH_FAILED) cuDNN	7	6762	October 12, 2021

CUDA_ERROR_OUT_OF_MEMORY: out of memory

Related topics