"Loss went to nans" when training lowRe model

gorpinich4 · November 22, 2023, 10:05am

Hello,

I tried to implement lowRe model for turbulent 2D channel case. When I start the training process I get “loss went to nans” error on the first iteration. I tried to determine the equatoin that causes this error and got the following output:

continuity tensor(78.7679, device='cuda:0', grad_fn=<DivBackward0>)
momentum_y tensor(238.5318, device='cuda:0', grad_fn=<DivBackward0>)
momentum_x tensor(198.9194, device='cuda:0', grad_fn=<DivBackward0>)
ep_equation tensor(nan, device='cuda:0', grad_fn=<DivBackward0>)
k_equation tensor(49.2178, device='cuda:0', grad_fn=<DivBackward0>)
p tensor(199.8294, device='cuda:0', grad_fn=<DivBackward0>)
p_init tensor(0.0016, device='cuda:0', grad_fn=<DivBackward0>)
u_init tensor(2770.5410, device='cuda:0', grad_fn=<DivBackward0>)
k_init tensor(5.1750, device='cuda:0', grad_fn=<DivBackward0>)
ep_init tensor(86.1649, device='cuda:0', grad_fn=<DivBackward0>)
v_init tensor(0.0023, device='cuda:0', grad_fn=<DivBackward0>)
[09:32:47] - loss went to Nans

So the error appears only in epsilon equation. Here I also attach some of my scripts that describe the implementation of pdes:

code.zip (3.6 KB)

Thanks in advance!

alessandro.bombini.fi · December 5, 2023, 3:56pm

Unfortunately I cannot help, but I would ask you: I am also plagued with “loss went to nans”; how have you checked which equation give rise to nan? I would like to to that in my case.

Sorry and thanks in Advance.

PS: i have noticed that the ep_network is the only one defined via FourierNetArch instead of instantiate_arch; I don’t know if this may be somehow involved.

gorpinich4 · December 8, 2023, 4:14pm

Hi @alessandro.bombini.fi ,

To have such output you have to modify the source code. Go to modulus/sym/trainer.py, in the row 562 you have something like if torch.isnan(loss):. Inside this if condition you can output the elements of losses dictionary (this dictionary contains all the losses that are in your domain, keys are the loss names, values are the corresponding torch tensors). I have something like this:

                # check for nans in loss
                if torch.isnan(loss):
                    for k, v in losses.items():
                        print(k, v)
                    self.log.error("loss went to Nans")
                    break

Actually, I fixed my error, the problem was caused by sympy.simplify() function. So if you have an equation that produces NaN values, try to remove simplify function in this equation.

Good luck!

alessandro.bombini.fi · December 8, 2023, 4:44pm

That is really an helpful insight. Thank you very much.

Do you think would it be better to use the logger instead of the built-in print function?

MarekRuzicka · June 12, 2024, 11:26am

I have run into a similar issue but my setup was way more complicated, 3D transient k-epsilon model. I have first tried to remove simplify but that didn’t help much (crashes occured less often, but they still have). Further investigation brought me to a conlusion that it is exploding gradient problem. Decreasing learning rate helped me here, so far no crash (600k steps). This is what I added to config.yaml:

optimizer:
  lr: 5e-5

For more or less complex training different learning rates might work.

Topic		Replies	Views
Model diverged with loss = nan TAO Toolkit	2	571	May 16, 2023
Re-training SSD-Mobilenet: gt_locations consist of nan values which causing Regression Loss to NaN Jetson Nano ai-training	2	913	September 13, 2022
Train_ssh.py only works with one dataset; other one returns Loss: nan Jetson Nano ai-training	4	616	October 15, 2021
Tao mask_rcnn training exits with NaN loss TAO Toolkit	12	61	September 18, 2024
Error training and converting to onnx with custom dataset Jetson Nano ai-training , nano2gb	12	1253	October 15, 2021
Low accuracy for MS COCO dataset in tao maskrcnn model training TAO Toolkit	10	38	September 17, 2024
Nan values appears while training Yolov4 using resnet 18 pretrained model TAO Toolkit	16	2276	December 29, 2021
.onnx file convert to trt got error Jetson TX2 tensorrt , jetson-inference	19	1434	January 25, 2023
Data corruption when running train_ssd script Jetson Nano python , training	10	899	September 12, 2022
Loss nan Error when tlt ssd train TAO Toolkit	2	515	October 12, 2021

"Loss went to nans" when training lowRe model

Related topics