Transfer learning with mulitple GPUs

I’ve got a program that runs on multiple GPUs, but would like to use the outputted model for transfer learning in a similar geometric case (different enough that I believe the parametric approach won’t be accurate).

When running the modified geometry I include the outputted model “flow_network.0.pth” in an otherwise empty outputs folder. When examining the output, I can see that the model is loaded:

e[32mSuccess loading model: e[0moutputs/FreeStream/flow_network.0.pth

However, I can also see that it is not loading the model for other GPUs:

model flow_network.1.pth not found

When including the model, is the best practice to simply copy the outputted and saved model to however many GPUs you are planning to use? Meaning, ‘flow_network.0.pth’ gets copied to **.1.pth, **.2.pth, … **.n.pth.

I’ve been operating on the assumption that the model weights are combined at the end of training and that is the reason that there is only one model saved. If that is not the case, is there a reason that I am only seeing one final outputted model when running in parallel? At the moment I run on 8 GPUs, but only ‘flow_network.0.pth’ is saved at the end.

Hi @patterson

Late response, but putting this here for future reference.

Modulus only saves the root process (proc 0) checkpoint since all weights across processes are the same. Hence only flow_network.0.pth is present to save some memory. On load you are right, Modulus does try to load checkpoints for each GPU out of default behavior.

However, PyTorch DDP only requires you to load the weights on process 0 because it will automatically broadcast process 0’s model weights to all other processes at the beginning of training. Quoted from the user guide:

The DDP constructor takes a reference to the local module, and broadcasts state_dict() from the process with rank 0 to all other processes in the group to make sure that all model replicas start from the exact same state.

Manually creating the other checkpoint files as you suggest works fine but should not be required.

Great, that keeps it easier for me. Thanks for info.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.