GPU memory requirements during training

Dustin.Webb · April 28, 2022, 9:07pm

For TLTv3 and TAO, I need to understand the amount of GPU memory required for training the various models available and how it scales based on batch size. I see that the logs contain a description of the model as well as a summarization of the of trainable and non-trainable parameters so, given assumptions about the data type used to represent the parameters, we can calculate the size of the model. However it’s unclear how much GPU memory is needed by TLT/TAO for performing optimization and how much remains for loading training data. How do I find this information?

Morganh · April 29, 2022, 1:49am

Refer to TAO Toolkit Quick Start Guide — TAO Toolkit 3.22.05 documentation and
IVA Getting Started Guide :: Metropolis Documentation

Dustin.Webb · April 29, 2022, 2:57am

Hi Morganh. It appears those links point to the first page of the user manuals. Were you intending to link to a specific section of the documentation?

Dustin.Webb · April 29, 2022, 3:08am

Hi Morganh. I’m realizing now that were intending to link to the minimum and recommended hardware requirements. That’s not the information I need. I am trying to understand how much memory is required for the model parameters and the optimization process.

The larger question I’m trying to answer is, how many RTX 2080 GPUs with 11GB of RAM would be required to match the training performance of 4 A6000 GPUs with 48GB of RAM or 8 A100 GPUs with 40GB of RAM?

Morganh · April 29, 2022, 6:53am

It is hard to draw a conclusion how much GPU memory is required for the model parameters and the optimization process. TAO provides the minimum and recommended GPU RAM to avoid OOM issue during training.

Dustin.Webb · April 29, 2022, 9:22pm

Are there tools in TLT/TAO to analyze memory usage?

Morganh · May 9, 2022, 2:11am

Inside TLT/TAO, there is not specific memory usage tool. Just use nvidia-smi to monitor.

Morganh · May 9, 2022, 3:53am

More, you can compare the “NvMapMemUsed” before/during running via running following command
$ cat /proc/meminfo

NvMapMemUsed: 75020 KB —> (Before running)
NvMapMemUsed: 392788 KB —> (Running)

Above is just an example, it means the GPU memory will consume (392788 KB - 75020 KB)

Dustin.Webb · May 10, 2022, 4:01pm

Ultimately I need a break down identifying how much GPU memory is used for:

Model parameters
Optimization process parameters
training samples

It doesn’t appear that this last option provides that.

Morganh · May 11, 2022, 7:08am

There is not such tool to break down above items as of now.

yingliu · July 6, 2022, 6:20am

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

system · July 20, 2022, 6:21am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.