Hello,

I tried to implement lowRe model for turbulent 2D channel case. When I start the training process I get “loss went to nans” error on the first iteration. I tried to determine the equatoin that causes this error and got the following output:

```
continuity tensor(78.7679, device='cuda:0', grad_fn=<DivBackward0>)
momentum_y tensor(238.5318, device='cuda:0', grad_fn=<DivBackward0>)
momentum_x tensor(198.9194, device='cuda:0', grad_fn=<DivBackward0>)
ep_equation tensor(nan, device='cuda:0', grad_fn=<DivBackward0>)
k_equation tensor(49.2178, device='cuda:0', grad_fn=<DivBackward0>)
p tensor(199.8294, device='cuda:0', grad_fn=<DivBackward0>)
p_init tensor(0.0016, device='cuda:0', grad_fn=<DivBackward0>)
u_init tensor(2770.5410, device='cuda:0', grad_fn=<DivBackward0>)
k_init tensor(5.1750, device='cuda:0', grad_fn=<DivBackward0>)
ep_init tensor(86.1649, device='cuda:0', grad_fn=<DivBackward0>)
v_init tensor(0.0023, device='cuda:0', grad_fn=<DivBackward0>)
[09:32:47] - loss went to Nans
```

So the error appears only in epsilon equation. Here I also attach some of my scripts that describe the implementation of pdes:

code.zip (3.6 KB)

Thanks in advance!

Unfortunately I cannot help, but I would ask you: I am also plagued with “loss went to nans”; how have you checked which equation give rise to nan? I would like to to that in my case.

Sorry and thanks in Advance.

PS: i have noticed that the ep_network is the only one defined via FourierNetArch instead of instantiate_arch; I don’t know if this may be somehow involved.

Hi @alessandro.bombini.fi ,

To have such output you have to modify the source code. Go to modulus/sym/trainer.py, in the row 562 you have something like `if torch.isnan(loss):`

. Inside this if condition you can output the elements of `losses`

dictionary (this dictionary contains all the losses that are in your domain, keys are the loss names, values are the corresponding torch tensors). I have something like this:

```
# check for nans in loss
if torch.isnan(loss):
for k, v in losses.items():
print(k, v)
self.log.error("loss went to Nans")
break
```

Actually, I fixed my error, the problem was caused by `sympy.simplify()`

function. So if you have an equation that produces NaN values, try to remove simplify function in this equation.

Good luck!

1 Like

That is really an helpful insight. Thank you very much.

Do you think would it be better to use the logger instead of the built-in `print`

function?

I have run into a similar issue but my setup was way more complicated, 3D transient k-epsilon model. I have first tried to remove simplify but that didn’t help much (crashes occured less often, but they still have). Further investigation brought me to a conlusion that it is exploding gradient problem. Decreasing learning rate helped me here, so far no crash (600k steps). This is what I added to config.yaml:

```
optimizer:
lr: 5e-5
```

For more or less complex training different learning rates might work.