Training a TLT model with multiple computers

I have two computers, each of which have multiple GPUs. Is it possible to train one TLT model using all the GPUs in both computers simultaneously?

It is possible. It belongs to the area “multi node training”.

Is there documentation that explains how to do so?

I will check if it is available. Currently, TLT documents does not contain it.

Please search “multinode” via google. One reference blog: Validating Distributed Multi-Node Autonomous Vehicle AI Training with NVIDIA DGX Systems on OpenShift with DXC Robotic Drive | NVIDIA Developer Blog

@Morganh, are you suggesting that the article you linked to explains how to do multinode training with TLT? I don’t see any thing helpful in it. I have tried various searches using terms like “multinode”, “multi node”, “TLT”, and “Transfer Learning Toolkit” but have found nothing outlining how to do multinode training with TLT.

No, that article does not explain how to do multinode training with TLT.
Officially, TLT does not announce the feature of multi-node training. So, end user will not find any guide about it.
But actually multinode is an environment for training. So, what I shared above are some references how to set up the environment. You can search ngc mpirun too. Then check if you can set up multi-node to run training which is not TLT. For example,
https://assets.ext.hpe.com/is/content/hpedam/a50000191enw
https://docs.abci.ai/en/ngc/

I’m unclear as to what you mean when you say that I can check to see if I can setup multi-node to running training which is not TLT. Can you please clarify?

I mean, please setup the environment for multi-node.If it is ready, then you can run training via your two computers. No matter what is training, tensorflow, pytorch or others.

1 Like