Different orders of training same data causes OOM?

ud_heller1989 · January 25, 2024, 3:01am

Hi Folks,
I am trying to debug some CUDA Out of Memory issue when training using a v100 16Gb GPU on a Ubuntu 22.04 machine. I created a very small dataset by subsampling from a larger dataset. It contains only 10 examples. I am training a text classification model. The model arch is a transformer encoder + a multihead attention + classification head. I noticed something very interesting: when I feed the data sequentially to the GPU (using pytorch sequential sampler), I got an OOM error. But when I randomly feed the data (using pytorch random sampler), I got no OOM error. Here “feed” means to do forward / backward pass using one example (I used batch size of 1 and I limited the number of input tokens to be less than 6500. And I am training on only 1 GPU. ). Does anyone know what could be the reason behind this? And is there a way to solve this (to have no OOM errors)? Thanks!

MarkusHoHo · January 25, 2024, 11:24am

Hi there,

I just had a quick look at your other posts and I wonder if this is all still related to the different drivers/GPUs?

For one, there is some difference between the PCI and the SXM versions of the GPU since the BUS connection is different, meaning there can be differences in memory saturation depending on bandwidth. The driver definitely makes a difference, although a 20% increase in perf would be surprising.

Coming back to the post here, I am not equipped to answer this, our AI category (where you posted your original question) would be better suited, so I suggest gollowing up with AakankshaS with some more detail. Even if you are not using TensorRT they might have some insight on the effect of batch sizes, layering, etc.

Asking for “the reason behind it” in context of Deep-Learning algorithms where you change the training parameters seems like something that is rather difficult to answer unless someone looks at your code and setup and spends a considerable amount of time experimenting.

ud_heller1989 · January 29, 2024, 3:01am

Hi Markus!
Sorry for the delay and thanks for the prompt response! Yeah, this question was related to my previous question. But it was kind of different. For my other question, I was basically asking why there is a big difference in terms of token limit that would cause OOM on the same type of GPU. My second question is more about how come for the same data, the order in which it was fed to the GPU could make a difference in causing OOM errors. I.e., with one order it could cause OOM but with another order, it won’t. Sorry I can not post the entire code here as this is owned by a commercial company. I was just curious if anyone else has seen anything similar or knows if this is possible. Thanks!

Topic		Replies	Views
ResourceExhaustedError: OOM when allocating tensor with shape[128,8,21].... Frameworks tensorflow	5	4146	December 20, 2019
Optimize fine tuning of a Citrinet model in multi GPU environment Frameworks nemo	0	761	October 28, 2021
Batching MCMC OOM issue in Numpyro + Jax JAX	1	93	March 11, 2025
CUDA Out of Memory on RTX 3060 with TF/Pytorch cuDNN	4	6162	August 26, 2021
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9351	January 7, 2008
GPU SYSMEN Shared Memory is slow - Is possible use REBAR with this to fix performance? TensorRT	4	372	April 6, 2024
CUDA_ERROR_OUT_OF_MEMORY: out of memory when there is actually no such a large tensor to allocate cuDNN	1	12804	December 28, 2019
random memory errors when kernel writes data CUDA Programming and Performance	9	2296	June 28, 2012
Out of memory error thrown by the driver instead of OpenGL OpenGL	4	17917	February 10, 2016
programming on CPU like it was a GPU applying GPU programming style to CPU CUDA Programming and Performance	1	2677	September 26, 2011

Different orders of training same data causes OOM?

Related topics