Invalid Loss

Please update to a new version of the wheel which has already been released to PyPI.
$ pip3 install nvidia-tao==0.1.24

Is that the reason for nan and container stopping issue.

For nan loss, I am afraid it is still related to training parameters or training images.

  • Experiment1: Can you run default jupyter notebook successfully? The default notebook is training against public KITTI dataset.
  • Experiment2:Can you try a small amount of your training images to check if it is still reproduced?

Hi, Even after the update I am facing the same issue with multiple GPU’s.
d1313b9e72a3:89 [0] NCCL INFO Launch mode Parallel
12/1649 […] - ETA: 10:22:39 - loss: 20093.9831Batch 11: Invalid loss, terminating training

Experiment 1 I have and am able to run default notebook successfully.
Experiment 2 I have trained using the same dataset earlier and generated a decent model. But now, when I use the same data again… the training fails.
Something that was working fine earlier with a particular dataset is NOW not working the same. Besides I have changed nothing.

Could you please check if now it is still working with the default notebook with multi GPUs?

So you are telling me to use a new notebook? and not the one that I have modified the parameters?

What difference does it make?

Just in order to narrow down. Since you mentioned that the multi gpus was working fine earlier with your dataset but now not working the same, and last week there is a blocking issue(see Chmod: cannot access '/opt/ngccli/ngc': No such file or directory - #2 by Morganh) , I am not sure if that issue will result in your current nan loss issue.
So, if possible, just to run default jupyter notebook(run against KITTI dataset) again to check if it still works.
If it works, that means it is not related to above-mentioned issue. It is needed to check more in your training dataset or parameters.

I will Try, but again… I have generated a decent model with the same dataset and parameters.

For more information about alternatives visit: (‘Overview — Numba 0.50.1 documentation’, ‘#cudatoolkit-lookup’)
warnings.warn(errors.NumbaWarning(msg))
2022-06-23 09:26:41,758 [INFO] iva.common.export.keras_exporter: Using input nodes: [‘input_1’]
2022-06-23 09:26:41,759 [INFO] iva.common.export.keras_exporter: Using output nodes: [‘predictions/Softmax’]
NOTE: UFF has been tested with TensorFlow 1.14.0.
WARNING: The version of TensorFlow installed on this system is not guaranteed to work with UFF.
DEBUG: convert reshape to flatten node
DEBUG [/usr/local/lib/python3.6/dist-packages/uff/converters/tensorflow/converter.py:96] Marking [‘predictions/Softmax’] as outputs
2022-06-23 09:26:46,204 [INFO] iva.common.export.keras_exporter: Calibration takes time especially if number of batches is large.
terminate called after throwing an instance of ‘pybind11::error_already_set’
what(): ValueError: Batch size yielded from data source 8 < requested batch size from calibrator 16

At:
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/tensorfile_calibrator.py(79): get_data_from_source
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/tensorfile_calibrator.py(95): get_batch
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py(537): init
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py(696): init
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/keras_exporter.py(445): export
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py(250): run_export
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/export.py(42): main
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/export.py(46):

When I run the EXPORT command. Do you know how to fix this?

I believe this is a new topic. Please create a new topic and share the full command and log. Thanks.