Different orders of training same data causes OOM?

Hi Folks,
I am trying to debug some CUDA Out of Memory issue when training using a v100 16Gb GPU on a Ubuntu 22.04 machine. I created a very small dataset by subsampling from a larger dataset. It contains only 10 examples. I am training a text classification model. The model arch is a transformer encoder + a multihead attention + classification head. I noticed something very interesting: when I feed the data sequentially to the GPU (using pytorch sequential sampler), I got an OOM error. But when I randomly feed the data (using pytorch random sampler), I got no OOM error. Here “feed” means to do forward / backward pass using one example (I used batch size of 1 and I limited the number of input tokens to be less than 6500. And I am training on only 1 GPU. ). Does anyone know what could be the reason behind this? And is there a way to solve this (to have no OOM errors)? Thanks!

Hi there,

I just had a quick look at your other posts and I wonder if this is all still related to the different drivers/GPUs?

For one, there is some difference between the PCI and the SXM versions of the GPU since the BUS connection is different, meaning there can be differences in memory saturation depending on bandwidth. The driver definitely makes a difference, although a 20% increase in perf would be surprising.

Coming back to the post here, I am not equipped to answer this, our AI category (where you posted your original question) would be better suited, so I suggest gollowing up with AakankshaS with some more detail. Even if you are not using TensorRT they might have some insight on the effect of batch sizes, layering, etc.

Asking for “the reason behind it” in context of Deep-Learning algorithms where you change the training parameters seems like something that is rather difficult to answer unless someone looks at your code and setup and spends a considerable amount of time experimenting.

Hi Markus!
Sorry for the delay and thanks for the prompt response! Yeah, this question was related to my previous question. But it was kind of different. For my other question, I was basically asking why there is a big difference in terms of token limit that would cause OOM on the same type of GPU. My second question is more about how come for the same data, the order in which it was fed to the GPU could make a difference in causing OOM errors. I.e., with one order it could cause OOM but with another order, it won’t. Sorry I can not post the entire code here as this is owned by a commercial company. I was just curious if anyone else has seen anything similar or knows if this is possible. Thanks!