Applying Multiple Optimizers in Modulus

cpe.sk · March 29, 2023, 7:51pm

I have a question about using multiple optimizers when training my neural network: Is it possible to train the model for 15,000 steps with adam and then continue training for 10,000 steps with BFGS for instance?
How can this be implemented in the config file or the python file?

Please assist.

ngeneva · March 29, 2023, 9:36pm

Hi @cpe.sk

Please see the following thread for information on this. The short answer is you will need to stop training, then start again with the other optimizer / different config. Should be cleanly achievable using Hydra config.

cpe.sk · March 29, 2023, 11:01pm

Thanks for the prompt reply and great support!
I tried training my model for 50,000 steps using adam:

defaults :
  - modulus_default
  - arch:
      - fully_connected
      - conv_fully_connected
      - multiscale_fourier
      - multiplicative_fourier
      - highway_fourier
      - fourier
      - modified_fourier
  - scheduler: exponential_lr
  - optimizer: adam
  - loss: sum
  - _self_

arch:
  conv_fully_connected:
    adaptive_activations: True
  fourier:
    adaptive_activations: True
    frequencies: ("gaussian", 0.25, 5)
  modified_fourier:
    adaptive_activations: True
    frequencies: ("gaussian", 0.25, 4)
  multiscale_fourier:
    adaptive_activations: True
    frequencies: ("gaussian", 0.25, 4),
  highway_fourier:
      adaptive_activations: True
      frequencies: ("axis", [i for i in range(12)])


save_filetypes : "vtk,npz"


scheduler:
  gamma: 0.99998718


optimizer:
  lr: 2.e-4

training:
  rec_results_freq: 1000
  max_steps : 50000

batch_size:
  IC: 1024
  BC: 1024
  interior: 4096

Then I changed the file to:

defaults :
  - modulus_default
  - arch:
      - fully_connected
      - conv_fully_connected
      - multiscale_fourier
      - multiplicative_fourier
      - highway_fourier
      - fourier
      - modified_fourier
  - scheduler: exponential_lr
  - optimizer: bfgs
  - loss: sum
  - _self_

arch:
  conv_fully_connected:
    adaptive_activations: True
  fourier:
    adaptive_activations: True
    frequencies: ("gaussian", 0.25, 5)
  modified_fourier:
    adaptive_activations: True
    frequencies: ("gaussian", 0.25, 4)
  multiscale_fourier:
    adaptive_activations: True
    frequencies: ("gaussian", 0.25, 4),
  highway_fourier:
      adaptive_activations: True
      frequencies: ("axis", [i for i in range(12)])


save_filetypes : "vtk,npz"


scheduler:
  gamma: 0.99998718
optimizer:
  lr: 2.e-4

training:
  rec_results_freq: 1000
  max_steps : 15000

batch_size:
  IC: 1024
  BC: 1024
  interior: 4096

however, no additional steps of training were performed. Running the code only restores adam training and then prints out that the training is done.

Please advise if I’m doing something wrong.

ngeneva · March 30, 2023, 2:19am

Hi @cpe.sk

We didn’t anticipate many people would be interested in trying multiple optimizers so the checkpointed optimizer state is always loaded if present. Try renaming that file, some info later in that thread (towards the end of the post is the relevant part):

cpe.sk · March 30, 2023, 5:04am

Thanks a lot, @ngeneva!

I tried renaming the optim_checkpoint.pth file, I also set max_steps to 15,000 (adam performed 50,000 before) however, this was the output:

[00:52:09] - JIT using the NVFuser TorchScript backend
[00:52:09] - JitManager: {'_enabled': True, '_arch_mode': <JitArchMode.ONLY_ACTIVATION: 1>, '_use_nvfuser': True, '_autograd_nodes': False}
[00:52:09] - GraphManager: {'_func_arch': False, '_debug': False, '_func_arch_allow_partial_hessian': True}
[00:52:50] - Installed PyTorch version 1.13.0+cu117 is not TorchScript supported in Modulus. Version 1.13.0a0+d321be6 is officially supported.
[00:52:50] - attempting to restore from: outputs/wave_2d_homo_CFC_FN
[00:52:50] - optimizer checkpoint not found
[00:52:50] - Success loading model: outputs/wave_2d_homo_CFC_FN/source_network.0.pth
[00:52:50] - Success loading model: outputs/wave_2d_homo_CFC_FN/wave_network.0.pth
[00:53:03] - lbfgs optimizer selected. Setting max_steps to 0
[00:53:11] - [step:          0] lbfgs optimization in running

[00:57:02] - lbfgs optimization completed after 1000 steps
[00:57:02] - [step:          0] record constraint batch time:  1.420e-01s
[00:57:18] - [step:          0] record validators time:  1.599e+01s
[00:57:27] - [step:          0] record inferencers time:  8.413e+00s
[00:57:33] - [step:          0] saved checkpoint to outputs/wave_2d_homo_CFC_FN
[00:57:33] - [step:          0] loss:  1.784e-02
[00:57:33] - [step:          0] reached maximum training steps, finished training!

Why does it say that the max number of steps is 0 while it was set to 15,000 in the config file?
I also tried creating two different config files with different names. The output was still the same.

Please help me set up this problem.

ngeneva · March 30, 2023, 3:46pm

Hi @cpe.sk

This is because L-BFGS does not run more than 1 training iteration. Rather a single training iteration, L-BFGS has multiple optimization iterations. You can control how many optimization iterations BFGS uses via the Hydra configs / looking at the PyTorch API.

cpe.sk · March 30, 2023, 5:37pm

Thank you @ngeneva for the suggestion. I tried changing max_iter by adding this to the config file:

optimizer:
  lr: 2.e-4
  max_iter: 15000


training:
  rec_results_freq: 1000
  max_steps : 15000

but it had the same result as finishing the training after one step.

[13:40:28] - JIT using the NVFuser TorchScript backend
[13:40:28] - JitManager: {'_enabled': True, '_arch_mode': <JitArchMode.ONLY_ACTIVATION: 1>, '_use_nvfuser': True, '_autograd_nodes': False}
[13:40:28] - GraphManager: {'_func_arch': False, '_debug': False, '_func_arch_allow_partial_hessian': True}
[13:40:51] - Installed PyTorch version 1.13.0 is not TorchScript supported in Modulus. Version 1.13.0a0+d321be6 is officially supported.
[13:40:51] - attempting to restore from: outputs\wave_2d_homo_CFC_FN
[13:40:51] - optimizer checkpoint not found
[13:40:52] - Success loading model: outputs\wave_2d_homo_CFC_FN\source_network.0.pth
[13:40:52] - Success loading model: outputs\wave_2d_homo_CFC_FN\wave_network.0.pth
[13:41:04] - lbfgs optimizer selected. Setting max_steps to 0
[13:41:08] - [step:          0] lbfgs optimization in running
[13:49:06] - lbfgs optimization completed after 2551 steps
[13:49:06] - [step:          0] record constraint batch time:  1.012e-01s
[13:49:23] - [step:          0] record validators time:  1.711e+01s
[13:49:31] - [step:          0] record inferencers time:  7.817e+00s
[13:49:33] - [step:          0] saved checkpoint to outputs\wave_2d_homo_CFC_FN
[13:49:33] - [step:          0] loss:  2.125e-02
[13:49:33] - [step:          0] reached maximum training steps, finished training!

I tried another optimizer like sgd, and it worked and continued the training steps. Unfortunately, it did not for bfgs.
Please assist.

lowrylhr · April 1, 2023, 8:47am

Hi, @ngeneva

I met the same problem.

In my model, I’m trying to use ADAM 1st and then switch to BFGS for the optimizer. I tried to train my model using ADAM for 30000 steps. I renamed my optim_checkpoint.pth file as you mentioned. Then I tried to change max_iter = 15000 in the config.yaml. But it seems that it only run 957 iterations and then finished. May I know where the problem is.
Here is the config:

defaults :
  - modulus_default
  #- arch:
  #    - fully_connected
  - scheduler: tf_exponential_lr
  - optimizer: bfgs
  #- loss: grad_norm
  - loss: sum
  - _self_

optimizer:
  max_iter: 15000

scheduler:
  decay_rate: 0.95
  decay_steps: 2000

training:
  rec_results_freq : 1000 
  rec_constraint_freq: 10000
  max_steps : 30000

Here is the log record:

[11:06:38] - JitManager: {'_enabled': False, '_arch_mode': <JitArchMode.ONLY_ACTIVATION: 1>, '_use_nvfuser': True, '_autograd_nodes': False}
[11:06:38] - GraphManager: {'_func_arch': True, '_debug': False, '_func_arch_allow_partial_hessian': True}
[11:09:46] - Arch Node: flow_network has been converted to a FuncArch node.
[12:11:04] - Arch Node: flow_network has been converted to a FuncArch node.
integral continuity implemented.
[12:11:05] - Arch Node: flow_network has been converted to a FuncArch node.
[12:11:05] - Arch Node: flow_network has been converted to a FuncArch node.
[12:11:19] - Arch Node: flow_network has been converted to a FuncArch node.
[12:11:33] - Arch Node: flow_network has been converted to a FuncArch node.
[12:11:47] - Arch Node: flow_network has been converted to a FuncArch node.
[12:11:47] - attempting to restore from: outputs/FX63-180_2_element_airfoil_varyAoA_flow
[12:11:47] - optimizer checkpoint not found
[12:11:47] - e[32mSuccess loading model: e[0moutputs/FX63-180_2_element_airfoil_varyAoA_flow/flow_network.0.pth[12:11:49] - lbfgs optimizer selected. Setting max_steps to 0
[12:11:51] - [step:          0] lbfgs optimization in running
[12:16:20] - lbfgs optimization completed after 957 steps
[12:16:20] - [step:          0] record constraint batch time:  5.098e-01s
[12:16:21] - [step:          0] record validators time:  7.620e-01s
Default plotter can only handle <=2 input dimensions, passing
Default plotter can only handle <=2 input dimensions, passing
[12:16:22] - [step:          0] record inferencers time:  5.885e-01s
[12:16:22] - [step:          0] record monitor time:  2.029e-01s
[12:16:36] - [step:          0] saved checkpoint to outputs/FX63-180_2_element_airfoil_varyAoA_flow
[12:16:36] - [step:          0] loss:  5.590e-03
[12:16:36] - [step:          0] reached maximum training steps, finished training!

Thank you!