Invalid Loss

Morganh · June 23, 2022, 6:17am

Please update to a new version of the wheel which has already been released to PyPI.
$ pip3 install nvidia-tao==0.1.24

rishika.v · June 23, 2022, 6:18am

Is that the reason for nan and container stopping issue.

Morganh · June 23, 2022, 6:30am

For nan loss, I am afraid it is still related to training parameters or training images.

Experiment1: Can you run default jupyter notebook successfully? The default notebook is training against public KITTI dataset.
Experiment2:Can you try a small amount of your training images to check if it is still reproduced?

rishika.v · June 23, 2022, 7:09am

Hi, Even after the update I am facing the same issue with multiple GPU’s.
d1313b9e72a3:89 [0] NCCL INFO Launch mode Parallel
12/1649 […] - ETA: 10:22:39 - loss: 20093.9831Batch 11: Invalid loss, terminating training

Experiment 1 I have and am able to run default notebook successfully.
Experiment 2 I have trained using the same dataset earlier and generated a decent model. But now, when I use the same data again… the training fails.
Something that was working fine earlier with a particular dataset is NOW not working the same. Besides I have changed nothing.

Morganh · June 23, 2022, 7:11am

Could you please check if now it is still working with the default notebook with multi GPUs?

rishika.v · June 23, 2022, 7:41am

So you are telling me to use a new notebook? and not the one that I have modified the parameters?

What difference does it make?

Morganh · June 23, 2022, 7:47am

Just in order to narrow down. Since you mentioned that the multi gpus was working fine earlier with your dataset but now not working the same, and last week there is a blocking issue(see Chmod: cannot access '/opt/ngccli/ngc': No such file or directory - #2 by Morganh) , I am not sure if that issue will result in your current nan loss issue.
So, if possible, just to run default jupyter notebook(run against KITTI dataset) again to check if it still works.
If it works, that means it is not related to above-mentioned issue. It is needed to check more in your training dataset or parameters.

rishika.v · June 23, 2022, 8:21am

I will Try, but again… I have generated a decent model with the same dataset and parameters.

rishika.v · June 23, 2022, 9:28am

For more information about alternatives visit: (‘Overview — Numba 0.50.1 documentation’, ‘#cudatoolkit-lookup’)
warnings.warn(errors.NumbaWarning(msg))
2022-06-23 09:26:41,758 [INFO] iva.common.export.keras_exporter: Using input nodes: [‘input_1’]
2022-06-23 09:26:41,759 [INFO] iva.common.export.keras_exporter: Using output nodes: [‘predictions/Softmax’]
NOTE: UFF has been tested with TensorFlow 1.14.0.
WARNING: The version of TensorFlow installed on this system is not guaranteed to work with UFF.
DEBUG: convert reshape to flatten node
DEBUG [/usr/local/lib/python3.6/dist-packages/uff/converters/tensorflow/converter.py:96] Marking [‘predictions/Softmax’] as outputs
2022-06-23 09:26:46,204 [INFO] iva.common.export.keras_exporter: Calibration takes time especially if number of batches is large.
terminate called after throwing an instance of ‘pybind11::error_already_set’
what(): ValueError: Batch size yielded from data source 8 < requested batch size from calibrator 16

At:
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/tensorfile_calibrator.py(79): get_data_from_source
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/tensorfile_calibrator.py(95): get_batch
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py(537): init
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py(696): init
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/keras_exporter.py(445): export
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py(250): run_export
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/export.py(42): main
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/export.py(46):

When I run the EXPORT command. Do you know how to fix this?

Morganh · June 23, 2022, 9:38am

I believe this is a new topic. Please create a new topic and share the full command and log. Thanks.

yingliu · July 11, 2022, 5:33am

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

system · July 25, 2022, 5:34am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.