The results of ldc_2d.py is not the same as those in SimNet_v21.06_User_Guide

yao1 · June 23, 2021, 4:08pm

I tried to run the first example ldc_2d.py without any modification, but the results I got were not the same as shown in SimNet_v21.06_User_Guide. It seems stop simulation with large error. The solution was not convergent and it stopped with total loss 0.00039. I ran the example with the bare metal version. Actually I installed the environment on the container service in the TWCC service in our center. The container used is the NGC TensorFlow Release 20.11 stated in
[TensorFlow Release Notes :: NVIDIA Deep Learning Frameworks Documentation]

ohennigh · June 23, 2021, 8:29pm

Hello, I believe that its working but you will need to train it for longer to recreate the loss curves seen in the user guide. This problem is setup to run for 400K iterations but the plots you have are only for 2.5K. To test though, could you run the helmholtz example? This converges much faster in 20K iterations and you can plot the results to see if they match the given validation data.

yao1 · June 24, 2021, 1:09am

How can I train it for longer? I used the same setting as in the original ldc_2d.py file. I ran and stopped at about 2.5K. And I have already tried the helmholtz example and the results look good. It converges to the validation results. The results for all steps show in the ldc_2d.py example are listed here:
total_loss: 0.00879224
time: 0.07409589767456054
total_loss: 0.0059654717
time: 0.013278253078460693
total_loss: 0.004125621
time: 0.013525300025939942
total_loss: 0.0030642985
time: 0.013485288619995118
total_loss: 0.0033810143
time: 0.013271934986114502
total_loss: 0.002713479
time: 0.013546154499053956
total_loss: 0.0017080866
time: 0.013386952877044677
total_loss: 0.0017460646
time: 0.013371756076812744
total_loss: 0.0013436687
time: 0.013651156425476074
total_loss: 0.001052431
time: 0.021090617179870607
saved to ./network_checkpoint_ldc_2d/
total_loss: 0.0009754368
time: 0.07807020187377929
total_loss: 0.00073354295
time: 0.013308625221252441
total_loss: 0.0010020003
time: 0.013653841018676758
total_loss: 0.0006775806
time: 0.013463542461395264
total_loss: 0.00049137336
time: 0.013459904193878174
total_loss: 0.0006492392
time: 0.013576748371124268
total_loss: 0.00039436677
time: 0.013347928524017333
total_loss: 0.00047523764
time: 0.013303215503692628
total_loss: 0.0005249865
time: 0.013294055461883544
2021-06-24 13:21:55.283895: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1343] No whitelist ops found, nothing to do
total_loss: 0.00035619925
time: 0.04161646842956543
saved to ./network_checkpoint_ldc_2d/

yao1 · June 29, 2021, 2:51am

After stop the job, I rerun the python ldc_2d.py. And I got the following information including one failure message:
For more information regarding mixed precision training, including how to make automatic mixed precision aware of a custom op type, please see the documentation available here:

2021-06-29 10:48:26.334006: W tensorflow/core/common_runtime/process_function_library_runtime.cc:688] Ignoring multi-device function optimization failure: Invalid argument: Node ‘_arg_continuity_0_3_0_arg’: Node name contains invalid characters
total_loss: 0.00034051613
time: 0.13434922218322753
total_loss: 0.00038129173
time: 0.01298861026763916
2021-06-29 10:48:40.876196: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1343] No whitelist ops found, nothingto do
loss went to Nans

mnabian · June 29, 2021, 6:44pm

The error message suggests that perhaps you are using automatic mixed precision, which is not currently supported in SimNet. Can you please do

export TF_ENABLE_AUTO_MIXED_PRECISION=0

and run this example again?

yao1 · July 1, 2021, 2:13am

It works fine now. Thanks. However, it still show the following message but it runs continuously. Is it OK to ignore this message?

W tensorflow/core/common_runtime/process_function_library_runtime.cc:688] Ignoring multi-device function optimization failure: Invalid argument: Node ‘_arg_continuity_0_3_0_arg’: Node name contains invalid characters

mnabian · July 1, 2021, 3:12pm

Glad to hear it is working now. Yes, you can ignore this warning message.

Topic		Replies	Views
The results of ldc_2d_zeroEq.py is not the same as those in SimNet_v21.06_User_Guide Technical Support (PhysicsNeMo Only)	0	957	February 1, 2022
Modulus-Sym examples _ ldc error Technical Support (PhysicsNeMo Only)	5	936	May 26, 2023
LPRNet: Invalid loss, terminating training TAO Toolkit	23	2549	December 22, 2021
Error when training LPRNet DeepStream SDK	2	940	May 25, 2021
Tensorflow model mix precision training error Deep Learning (Training & Inference) mixed-precision	0	546	November 5, 2019
automatic mixed precision failure Frameworks (archived) tensorflow	5	3260	October 4, 2021
Two-equation turbulence model Technical Support (PhysicsNeMo Only)	1	1333	June 21, 2022
Error when training LPRNet 2 (characters number < 35) TAO Toolkit	5	2004	July 7, 2021
cudnn lstm is broken above driver 431.60, 'Unexpected Event status: 1 cuda' cuDNN	14	8976	February 4, 2021
Taking a lot of time while running a lid driven cavity simulation Technical Support (PhysicsNeMo Only)	2	879	August 26, 2021

The results of ldc_2d.py is not the same as those in SimNet_v21.06_User_Guide

It works fine now. Thanks. However, it still show the following message but it runs continuously. Is it OK to ignore this message?

Related topics