GPU memory requirements during training

For TLTv3 and TAO, I need to understand the amount of GPU memory required for training the various models available and how it scales based on batch size. I see that the logs contain a description of the model as well as a summarization of the of trainable and non-trainable parameters so, given assumptions about the data type used to represent the parameters, we can calculate the size of the model. However it’s unclear how much GPU memory is needed by TLT/TAO for performing optimization and how much remains for loading training data. How do I find this information?

Refer to TAO Toolkit Quick Start Guide — TAO Toolkit 3.22.02 documentation and
IVA Getting Started Guide :: Metropolis Documentation

Hi Morganh. It appears those links point to the first page of the user manuals. Were you intending to link to a specific section of the documentation?

Hi Morganh. I’m realizing now that were intending to link to the minimum and recommended hardware requirements. That’s not the information I need. I am trying to understand how much memory is required for the model parameters and the optimization process.

The larger question I’m trying to answer is, how many RTX 2080 GPUs with 11GB of RAM would be required to match the training performance of 4 A6000 GPUs with 48GB of RAM or 8 A100 GPUs with 40GB of RAM?

It is hard to draw a conclusion how much GPU memory is required for the model parameters and the optimization process. TAO provides the minimum and recommended GPU RAM to avoid OOM issue during training.

Are there tools in TLT/TAO to analyze memory usage?

Inside TLT/TAO, there is not specific memory usage tool. Just use nvidia-smi to monitor.

More, you can compare the “NvMapMemUsed” before/during running via running following command
$ cat /proc/meminfo

NvMapMemUsed: 75020 KB —> (Before running)
NvMapMemUsed: 392788 KB —> (Running)

Above is just an example, it means the GPU memory will consume (392788 KB - 75020 KB)

Ultimately I need a break down identifying how much GPU memory is used for:

  1. Model parameters
  2. Optimization process parameters
  3. training samples

It doesn’t appear that this last option provides that.

There is not such tool to break down above items as of now.