I have a question about using multiple optimizers when training my neural network: Is it possible to train the model for 15,000 steps with adam and then continue training for 10,000 steps with BFGS for instance?
How can this be implemented in the config file or the python file?
Please see the following thread for information on this. The short answer is you will need to stop training, then start again with the other optimizer / different config. Should be cleanly achievable using Hydra config.
We didn’t anticipate many people would be interested in trying multiple optimizers so the checkpointed optimizer state is always loaded if present. Try renaming that file, some info later in that thread (towards the end of the post is the relevant part):
I tried renaming the optim_checkpoint.pth file, I also set max_steps to 15,000 (adam performed 50,000 before) however, this was the output:
[00:52:09] - JIT using the NVFuser TorchScript backend
[00:52:09] - JitManager: {'_enabled': True, '_arch_mode': <JitArchMode.ONLY_ACTIVATION: 1>, '_use_nvfuser': True, '_autograd_nodes': False}
[00:52:09] - GraphManager: {'_func_arch': False, '_debug': False, '_func_arch_allow_partial_hessian': True}
[00:52:50] - Installed PyTorch version 1.13.0+cu117 is not TorchScript supported in Modulus. Version 1.13.0a0+d321be6 is officially supported.
[00:52:50] - attempting to restore from: outputs/wave_2d_homo_CFC_FN
[00:52:50] - optimizer checkpoint not found
[00:52:50] - Success loading model: outputs/wave_2d_homo_CFC_FN/source_network.0.pth
[00:52:50] - Success loading model: outputs/wave_2d_homo_CFC_FN/wave_network.0.pth
[00:53:03] - lbfgs optimizer selected. Setting max_steps to 0
[00:53:11] - [step: 0] lbfgs optimization in running
[00:57:02] - lbfgs optimization completed after 1000 steps
[00:57:02] - [step: 0] record constraint batch time: 1.420e-01s
[00:57:18] - [step: 0] record validators time: 1.599e+01s
[00:57:27] - [step: 0] record inferencers time: 8.413e+00s
[00:57:33] - [step: 0] saved checkpoint to outputs/wave_2d_homo_CFC_FN
[00:57:33] - [step: 0] loss: 1.784e-02
[00:57:33] - [step: 0] reached maximum training steps, finished training!
Why does it say that the max number of steps is 0 while it was set to 15,000 in the config file?
I also tried creating two different config files with different names. The output was still the same.
This is because L-BFGS does not run more than 1 training iteration. Rather a single training iteration, L-BFGS has multiple optimization iterations. You can control how many optimization iterations BFGS uses via the Hydra configs / looking at the PyTorch API.
In my model, I’m trying to use ADAM 1st and then switch to BFGS for the optimizer. I tried to train my model using ADAM for 30000 steps. I renamed my optim_checkpoint.pth file as you mentioned. Then I tried to change max_iter = 15000 in the config.yaml. But it seems that it only run 957 iterations and then finished. May I know where the problem is.
Here is the config:
[11:06:38] - JitManager: {'_enabled': False, '_arch_mode': <JitArchMode.ONLY_ACTIVATION: 1>, '_use_nvfuser': True, '_autograd_nodes': False}
[11:06:38] - GraphManager: {'_func_arch': True, '_debug': False, '_func_arch_allow_partial_hessian': True}
[11:09:46] - Arch Node: flow_network has been converted to a FuncArch node.
[12:11:04] - Arch Node: flow_network has been converted to a FuncArch node.
integral continuity implemented.
[12:11:05] - Arch Node: flow_network has been converted to a FuncArch node.
[12:11:05] - Arch Node: flow_network has been converted to a FuncArch node.
[12:11:19] - Arch Node: flow_network has been converted to a FuncArch node.
[12:11:33] - Arch Node: flow_network has been converted to a FuncArch node.
[12:11:47] - Arch Node: flow_network has been converted to a FuncArch node.
[12:11:47] - attempting to restore from: outputs/FX63-180_2_element_airfoil_varyAoA_flow
[12:11:47] - optimizer checkpoint not found
[12:11:47] - e[32mSuccess loading model: e[0moutputs/FX63-180_2_element_airfoil_varyAoA_flow/flow_network.0.pth[12:11:49] - lbfgs optimizer selected. Setting max_steps to 0
[12:11:51] - [step: 0] lbfgs optimization in running
[12:16:20] - lbfgs optimization completed after 957 steps
[12:16:20] - [step: 0] record constraint batch time: 5.098e-01s
[12:16:21] - [step: 0] record validators time: 7.620e-01s
Default plotter can only handle <=2 input dimensions, passing
Default plotter can only handle <=2 input dimensions, passing
[12:16:22] - [step: 0] record inferencers time: 5.885e-01s
[12:16:22] - [step: 0] record monitor time: 2.029e-01s
[12:16:36] - [step: 0] saved checkpoint to outputs/FX63-180_2_element_airfoil_varyAoA_flow
[12:16:36] - [step: 0] loss: 5.590e-03
[12:16:36] - [step: 0] reached maximum training steps, finished training!