I was training a model and interrupted it during training to modify learning rate parameter. On next training, during model initialisation it started to throw errors:
2023-07-28 19:04:01.272126: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-28 19:04:02.112177: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21676 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:02:00.0, compute capability: 8.6
2023-07-28 19:04:02.522393: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 21.17G (22729785344 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:02.744404: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 19.05G (20456806400 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:02.965392: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 17.15G (18411124736 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:03.174299: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 15.43G (16570011648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:03.379019: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 13.89G (14913009664 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:03.594192: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 12.50G (13421708288 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:03.813215: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 11.25G (12079537152 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:04.028643: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 10.12G (10871582720 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:04.238775: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 9.11G (9784424448 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:04.451532: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 8.20G (8805982208 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:04.682381: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 7.38G (7925383680 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:04.901825: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 6.64G (7132845056 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:05.115339: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 5.98G (6419560448 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:05.327249: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 5.38G (5777604096 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:05.537412: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 4.84G (5199843328 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:05.749394: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 4.36G (4679858688 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 19:04:05.967369: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 3.92G (4211872768 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
It tries to allocate the memory, sometimes it successfully gets to ~8gb and initialize the model, and training process goes as usual. But process is not stable.
I’ve googled a bit, and found a kinda solution
# allow growth
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
except RuntimeError as e:
print(e)
It successfully initialize the model, compiles it, but crashes on first epoch:
2023-07-28 19:02:28.537433: E tensorflow/stream_executor/cuda/cuda_blas.cc:218] failed to create cublas handle: cublas error
2023-07-28 19:02:28.549723: E tensorflow/stream_executor/cuda/cuda_blas.cc:220] Failure to initialize cublas may be due to OOM (cublas needs some free memory when you initialize it, and your deep-learning framework may have preallocated more than its fair share), or may be because this binary was not built with support for the GPU in your machine.
2023-07-28 19:02:28.576924: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at matmul_op_impl.h:625 : INTERNAL: Attempting to perform BLAS operation using StreamExecutor without BLAS support
Could not load library cudnn_ops_infer64_8.dll. Error code 1455
Please make sure cudnn_ops_infer64_8.dll is in your library path!
File was still there. And it’s a bit strange that it starts to throw these errors. But I’ve reinstalled cuda, cudnn, tensorflow just in case.
To check the correct cuda+cudnn+tensorflow installation, I’ve ran import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))
, which imports tensorflow and does some basic tensor manipulations.
2023-07-28 20:30:29.787408: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-28 20:30:30.403115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21676 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:02:00.0, compute capability: 8.6
2023-07-28 20:30:30.654068: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 21.17G (22729785344 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 20:30:30.881606: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 19.05G (20456806400 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 20:30:31.105718: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 17.15G (18411124736 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 20:30:31.335749: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 15.43G (16570011648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 20:30:31.571129: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 13.89G (14913009664 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 20:30:31.807253: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 12.50G (13421708288 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 20:30:32.043077: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 11.25G (12079537152 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 20:30:32.286931: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 10.12G (10871582720 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 20:30:32.517767: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 9.11G (9784424448 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-07-28 20:30:32.786812: I tensorflow/stream_executor/cuda/cuda_driver.cc:733] failed to allocate 8.20G (8805982208 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
tf.Tensor(175.96857, shape=(), dtype=float32)
As you can see, it executes the function using cuda, but before that, throws the same error.
I could think that this is a vram issue, but occt test OK(but it checks only 10gb), mods\mats check OK, gaming is OK. I couldn’t find any solution on how to test cuda perfomance with using batches of memory. I’ll get 2080ti soon to check if itll work on this machine to be sure if it’s a software issue or hardware issue. But again, gaming is ok, no artifacts.
my specs:
- windows 10
- rtx 3090
- tensorflow <2.11