Model diverged with loss = nan

Please provide the following information when requesting support.

• Hardware (A30)
• Network Type (MaskRCNN with Resnet34)
• TLT Version (tao-toolkit-tf:v3.22.05-tf1.15.5-py3)
• Training spec file(
maskrcnn_train_resnet34.txt (2.4 KB)
)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
I have error as

ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Model diverged with loss = NaN

The whole error is

File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1426, in run
    run_metadata=run_metadata))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/basic_session_run_hooks.py", line 761, in after_run
    raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[18199,1],3]
  Exit code:    1
--------------------------------------------------------------------------
root@88e3cb9d5722:/workspace/Nyan/cv_samples_v1.3.0/mask_rcnn# 

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

For nan loss, suggest to set a lower learning rate.

More, lease double check the annotation json file.
Refer to MaskRCNN - NVIDIA Docs and The first class is always not detected in inference - #25 by Morganh

The `id` under `categories` in the annotation file should start from 1.

You can also run the Mask_rcnn notebook for reference.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.