torch.OutOfMemoryError: CUDA out of memory when training model

I was training a GPT2 language model (on a Tesla T4 with 16 gbs memory) and would occasionally run into the error:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.00 MiB. GPU 0 has a total capacity of 14.57 GiB of which 8.75 MiB is free. Including non-PyTorch memory, this process has 14.56 GiB memory in use. Of the allocated memory 14.41 GiB is allocated by PyTorch, and 14.42 MiB is reserved by PyTorch but unallocated.

I found that the error only came up when the batch size or the seq_length was too large (originally at seq_length at 128 and batch size 128), so lowering these solved the issue (lowered batch size to 16).

I understand that torch is trying to allocate as much memory as possible on the GPU for training as even torch.cuda.set_per_process_memory_fraction(0.9) or lower would cause the error to reappear. Therefore, would a GPU with more memory not run into this issue and be able to deal with larger batch sizes/seq_lengths?

I was also wondering if anyone knew the exact equations/math needed to calculate the memory usage based on model and training configs to set these correctly before running into the issue and optimizing the training.