The results of ldc_2d.py is not the same as those in SimNet_v21.06_User_Guide

I tried to run the first example ldc_2d.py without any modification, but the results I got were not the same as shown in SimNet_v21.06_User_Guide. It seems stop simulation with large error. The solution was not convergent and it stopped with total loss 0.00039. I ran the example with the bare metal version. Actually I installed the environment on the container service in the TWCC service in our center. The container used is the NGC TensorFlow Release 20.11 stated in
[TensorFlow Release Notes :: NVIDIA Deep Learning Frameworks Documentation]

Hello, I believe that its working but you will need to train it for longer to recreate the loss curves seen in the user guide. This problem is setup to run for 400K iterations but the plots you have are only for 2.5K. To test though, could you run the helmholtz example? This converges much faster in 20K iterations and you can plot the results to see if they match the given validation data.

How can I train it for longer? I used the same setting as in the original ldc_2d.py file. I ran and stopped at about 2.5K. And I have already tried the helmholtz example and the results look good. It converges to the validation results. The results for all steps show in the ldc_2d.py example are listed here:
total_loss: 0.00879224
time: 0.07409589767456054
total_loss: 0.0059654717
time: 0.013278253078460693
total_loss: 0.004125621
time: 0.013525300025939942
total_loss: 0.0030642985
time: 0.013485288619995118
total_loss: 0.0033810143
time: 0.013271934986114502
total_loss: 0.002713479
time: 0.013546154499053956
total_loss: 0.0017080866
time: 0.013386952877044677
total_loss: 0.0017460646
time: 0.013371756076812744
total_loss: 0.0013436687
time: 0.013651156425476074
total_loss: 0.001052431
time: 0.021090617179870607
saved to ./network_checkpoint_ldc_2d/
total_loss: 0.0009754368
time: 0.07807020187377929
total_loss: 0.00073354295
time: 0.013308625221252441
total_loss: 0.0010020003
time: 0.013653841018676758
total_loss: 0.0006775806
time: 0.013463542461395264
total_loss: 0.00049137336
time: 0.013459904193878174
total_loss: 0.0006492392
time: 0.013576748371124268
total_loss: 0.00039436677
time: 0.013347928524017333
total_loss: 0.00047523764
time: 0.013303215503692628
total_loss: 0.0005249865
time: 0.013294055461883544
2021-06-24 13:21:55.283895: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1343] No whitelist ops found, nothing to do
total_loss: 0.00035619925
time: 0.04161646842956543
saved to ./network_checkpoint_ldc_2d/

After stop the job, I rerun the python ldc_2d.py. And I got the following information including one failure message:
For more information regarding mixed precision training, including how to make automatic mixed precision aware of a custom op type, please see the documentation available here:

2021-06-29 10:48:26.334006: W tensorflow/core/common_runtime/process_function_library_runtime.cc:688] Ignoring multi-device function optimization failure: Invalid argument: Node ‘_arg_continuity_0_3_0_arg’: Node name contains invalid characters
total_loss: 0.00034051613
time: 0.13434922218322753
total_loss: 0.00038129173
time: 0.01298861026763916
2021-06-29 10:48:40.876196: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1343] No whitelist ops found, nothingto do
loss went to Nans

The error message suggests that perhaps you are using automatic mixed precision, which is not currently supported in SimNet. Can you please do

export TF_ENABLE_AUTO_MIXED_PRECISION=0

and run this example again?

It works fine now. Thanks. However, it still show the following message but it runs continuously. Is it OK to ignore this message?

W tensorflow/core/common_runtime/process_function_library_runtime.cc:688] Ignoring multi-device function optimization failure: Invalid argument: Node ‘_arg_continuity_0_3_0_arg’: Node name contains invalid characters

1 Like

Glad to hear it is working now. Yes, you can ignore this warning message.