Stop the training when training loss reach a tolerance value

prakhar_sharma · February 18, 2023, 2:38am

Currently, only these stopping criteria are supported;

 stop_criterion:
     - metric: 'l2_relative_error_u'
     - min_delta: 0.1
     - patience: 5000
     - mode: 'min'
     - freq: 2000
     - strict: true

I wanted to stop the training when the training loss goes below a certain limit. In simple words.

while training:
    if training loss <tol:
        break

I am using a bare metal NVIDIA Modulus, so I can edit the source code, if it needs slight modification to achieve this. I can see in the trainer.py a simple break is implemented to stop the training when stopping criteria is met or when maximum training iterations is reached. https://gitlab.com/nvidia/modulus/modulus/-/blob/release_22.09/modulus/trainer.py#L669

I want to add the if condition here at the start of each iteration. https://gitlab.com/nvidia/modulus/modulus/-/blob/release_22.09/modulus/trainer.py#L496
How do I access the training loss? Is it a part of the dictionary losses?

I also need to save the iteration number where the training loss met this criteria.

ngeneva · February 25, 2023, 3:00am

Hi @prakhar_sharma

Yes, losses is a dictionary of loss values computed here. So you can add any logic involving your losses after that to exit the training loop (can make a exit flag, set step = self.max_steps + 1 to break the loop, etc.).

(For info about what is that loss dictionary, you can see the trainer iterating over the losses for logging here)

system · March 11, 2023, 3:01am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to terminate training at a certain tolerance value Technical Support (PhysicsNeMo Only)	2	1651	February 26, 2024
Any example about using microsoft/nni with NVIDIA modulus? Technical Support (PhysicsNeMo Only)	3	693	January 25, 2023
How to make the Modulus stopping criterion more sensitive Technical Support (PhysicsNeMo Only)	2	807	December 21, 2022
Problem with using Criterion Based Stopping Technical Support (PhysicsNeMo Only)	3	1850	November 15, 2022
Epochs + Training loss TensorBoard Technical Support (PhysicsNeMo Only)	0	1032	April 3, 2024
How do you access a trained Modulus Model checkpoint Technical Support (PhysicsNeMo Only)	4	1384	July 14, 2023
Is it possible to include training data within modulus? Technical Support (PhysicsNeMo Only)	1	824	May 18, 2022
Evaluation of the model after training Technical Support (PhysicsNeMo Only)	2	1104	October 4, 2022
Model training early stop TAO Toolkit	7	681	October 12, 2021
Which step the network model is saved Technical Support (PhysicsNeMo Only)	1	535	July 14, 2023

Stop the training when training loss reach a tolerance value

Related topics