Stop the training when training loss reach a tolerance value

Currently, only these stopping criteria are supported;

 stop_criterion:
     - metric: 'l2_relative_error_u'
     - min_delta: 0.1
     - patience: 5000
     - mode: 'min'
     - freq: 2000
     - strict: true

I wanted to stop the training when the training loss goes below a certain limit. In simple words.

while training:
    if training loss <tol:
        break

I am using a bare metal NVIDIA Modulus, so I can edit the source code, if it needs slight modification to achieve this. I can see in the trainer.py a simple break is implemented to stop the training when stopping criteria is met or when maximum training iterations is reached. https://gitlab.com/nvidia/modulus/modulus/-/blob/release_22.09/modulus/trainer.py#L669

I want to add the if condition here at the start of each iteration. https://gitlab.com/nvidia/modulus/modulus/-/blob/release_22.09/modulus/trainer.py#L496
How do I access the training loss? Is it a part of the dictionary losses?

I also need to save the iteration number where the training loss met this criteria.

1 Like

Hi @prakhar_sharma

Yes, losses is a dictionary of loss values computed here. So you can add any logic involving your losses after that to exit the training loop (can make a exit flag, set step = self.max_steps + 1 to break the loop, etc.).

(For info about what is that loss dictionary, you can see the trainer iterating over the losses for logging here)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.