torch.OutOfMemoryError: CUDA out of memory when training model

bladebreaker17 · January 6, 2025, 7:19pm

I was training a GPT2 language model (on a Tesla T4 with 16 gbs memory) and would occasionally run into the error:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.00 MiB. GPU 0 has a total capacity of 14.57 GiB of which 8.75 MiB is free. Including non-PyTorch memory, this process has 14.56 GiB memory in use. Of the allocated memory 14.41 GiB is allocated by PyTorch, and 14.42 MiB is reserved by PyTorch but unallocated.

I found that the error only came up when the batch size or the seq_length was too large (originally at seq_length at 128 and batch size 128), so lowering these solved the issue (lowered batch size to 16).

I understand that torch is trying to allocate as much memory as possible on the GPU for training as even torch.cuda.set_per_process_memory_fraction(0.9) or lower would cause the error to reappear. Therefore, would a GPU with more memory not run into this issue and be able to deal with larger batch sizes/seq_lengths?

I was also wondering if anyone knew the exact equations/math needed to calculate the memory usage based on model and training configs to set these correctly before running into the issue and optimizing the training.

Topic		Replies	Views
OutOfMemoryError CUDA Programming and Performance cuda , pytorch	1	1047	March 13, 2024
CUDA out of memory CUDA Programming and Performance cuda , deep-learning	1	1119	July 8, 2021
How to solve OutOfMemory error except changing a larger GPU? TensorRT	10	2143	February 18, 2023
Out of memory error tensorflow Frameworks (archived) cuda , tensorflow , python	0	831	November 3, 2021
Cuda out of memory error CUDA Programming and Performance	1	1174	December 13, 2023
Limit tortoise-tts to less than 2GB memory? CUDA Programming and Performance	12	665	August 3, 2024
GPU Cuda out of memory error CUDA Programming and Performance gpu , gpu-computing	2	1551	July 7, 2023
CUDA out of memory??? CUDA Programming and Performance	0	593	May 16, 2019
Modulus release_22.09 - helmholz example fails with RuntimeError: CUDA out of memory on GeForce GTX 1650 Report a Bug (PhysicsNeMo Only)	3	1693	November 21, 2022
CUDA out of memory Frameworks (archived) pytorch	1	985	April 1, 2020

torch.OutOfMemoryError: CUDA out of memory when training model

Related topics