"Loss went to nans" when training lowRe model

Hello,

I tried to implement lowRe model for turbulent 2D channel case. When I start the training process I get “loss went to nans” error on the first iteration. I tried to determine the equatoin that causes this error and got the following output:

continuity tensor(78.7679, device='cuda:0', grad_fn=<DivBackward0>)
momentum_y tensor(238.5318, device='cuda:0', grad_fn=<DivBackward0>)
momentum_x tensor(198.9194, device='cuda:0', grad_fn=<DivBackward0>)
ep_equation tensor(nan, device='cuda:0', grad_fn=<DivBackward0>)
k_equation tensor(49.2178, device='cuda:0', grad_fn=<DivBackward0>)
p tensor(199.8294, device='cuda:0', grad_fn=<DivBackward0>)
p_init tensor(0.0016, device='cuda:0', grad_fn=<DivBackward0>)
u_init tensor(2770.5410, device='cuda:0', grad_fn=<DivBackward0>)
k_init tensor(5.1750, device='cuda:0', grad_fn=<DivBackward0>)
ep_init tensor(86.1649, device='cuda:0', grad_fn=<DivBackward0>)
v_init tensor(0.0023, device='cuda:0', grad_fn=<DivBackward0>)
[09:32:47] - loss went to Nans

So the error appears only in epsilon equation. Here I also attach some of my scripts that describe the implementation of pdes:

code.zip (3.6 KB)

Thanks in advance!

Unfortunately I cannot help, but I would ask you: I am also plagued with “loss went to nans”; how have you checked which equation give rise to nan? I would like to to that in my case.

Sorry and thanks in Advance.

PS: i have noticed that the ep_network is the only one defined via FourierNetArch instead of instantiate_arch; I don’t know if this may be somehow involved.

Hi @alessandro.bombini.fi ,

To have such output you have to modify the source code. Go to modulus/sym/trainer.py, in the row 562 you have something like if torch.isnan(loss):. Inside this if condition you can output the elements of losses dictionary (this dictionary contains all the losses that are in your domain, keys are the loss names, values are the corresponding torch tensors). I have something like this:

                # check for nans in loss
                if torch.isnan(loss):
                    for k, v in losses.items():
                        print(k, v)
                    self.log.error("loss went to Nans")
                    break

Actually, I fixed my error, the problem was caused by sympy.simplify() function. So if you have an equation that produces NaN values, try to remove simplify function in this equation.

Good luck!

1 Like

That is really an helpful insight. Thank you very much.

Do you think would it be better to use the logger instead of the built-in print function?

I have run into a similar issue but my setup was way more complicated, 3D transient k-epsilon model. I have first tried to remove simplify but that didn’t help much (crashes occured less often, but they still have). Further investigation brought me to a conlusion that it is exploding gradient problem. Decreasing learning rate helped me here, so far no crash (600k steps). This is what I added to config.yaml:

optimizer:
  lr: 5e-5

For more or less complex training different learning rates might work.