Invalid Loss

I find one culprit on your spec.
For YOLOv4-tiny, if using cspdarknet_tiny arch, to align with anchor shapes, only big_grid_xy_extend and mid_grid_xy_extend should be provided; if using cspdarknet_tiny_3l arch, all of them should be provided.

So, please add small_grid_xy_extend to retry.

For sequence data format, refer to YOLOv4-tiny — TAO Toolkit 3.22.05 documentation

Hi Sequence data format gives me the following error
INFO: Validation dataset specified by validation_fold requires the training label format to be TFRecords.
Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 145, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 707, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 695, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 141, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 126, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 63, in run_experiment
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/utils/spec_loader.py”, line 57, in load_experiment_spec
AssertionError: Validation dataset specified by validation_fold requires the training label format to be TFRecords.
Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 145, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 707, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 695, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 141, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 126, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 63, in run_experiment
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/utils/spec_loader.py”, line 57, in load_experiment_spec
AssertionError: Validation dataset specified by validation_fold requires the training label format to be TFRecords.

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[19457,1],1]
Exit code: 1

2022-06-21 14:21:15,559 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

adding small_grid_xy_extend to the .txt file does nothing still throws the same error.

Now even for classification am facing the same issue.
dcda39c77a8:158:253 [1] NCCL INFO comm 0x7faf387dbc80 rank 1 nranks 2 cudaDev 1 busId a000 - Init COMPLETE
edcda39c77a8:157:256 [0] NCCL INFO comm 0x7fd7c07e5810 rank 0 nranks 2 cudaDev 0 busId 4000 - Init COMPLETE
edcda39c77a8:157:256 [0] NCCL INFO Launch mode Parallel
1/3633 […] - ETA: 4:38:20 - loss: 2.9416 - acc: 0.0000e+00WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:186: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

2022-06-21 08:18:43,918 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:186: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

204/3633 [>…] - ETA: 3:17 - loss: nan - acc: 0.0539^C

According to above logs, all is not the same issue.

Please share the spec file with sequence data format.

Hi, I am facing this issue while using multiple GPU’s.
Do you know what is causing this… Because earlier I have trained on Multi GPU’s and never faced this issue?

So, do you mean

  • you meet nan loss loss while training with multi gpus?
  • you can train yolov4_tiny with multi gpus earlier but meet issue this week?

Yes, you got it right.

I don’t know what happened all of a sudden. And also When I run the jupyter notebook, the container just stops once I run any commands on the notebook.
Is there an issue to fix it?

Please update to a new version of the wheel which has already been released to PyPI.
$ pip3 install nvidia-tao==0.1.24

Is that the reason for nan and container stopping issue.

For nan loss, I am afraid it is still related to training parameters or training images.

  • Experiment1: Can you run default jupyter notebook successfully? The default notebook is training against public KITTI dataset.
  • Experiment2:Can you try a small amount of your training images to check if it is still reproduced?

Hi, Even after the update I am facing the same issue with multiple GPU’s.
d1313b9e72a3:89 [0] NCCL INFO Launch mode Parallel
12/1649 […] - ETA: 10:22:39 - loss: 20093.9831Batch 11: Invalid loss, terminating training

Experiment 1 I have and am able to run default notebook successfully.
Experiment 2 I have trained using the same dataset earlier and generated a decent model. But now, when I use the same data again… the training fails.
Something that was working fine earlier with a particular dataset is NOW not working the same. Besides I have changed nothing.

Could you please check if now it is still working with the default notebook with multi GPUs?

So you are telling me to use a new notebook? and not the one that I have modified the parameters?

What difference does it make?

Just in order to narrow down. Since you mentioned that the multi gpus was working fine earlier with your dataset but now not working the same, and last week there is a blocking issue(see Chmod: cannot access '/opt/ngccli/ngc': No such file or directory - #2 by Morganh) , I am not sure if that issue will result in your current nan loss issue.
So, if possible, just to run default jupyter notebook(run against KITTI dataset) again to check if it still works.
If it works, that means it is not related to above-mentioned issue. It is needed to check more in your training dataset or parameters.

I will Try, but again… I have generated a decent model with the same dataset and parameters.

For more information about alternatives visit: (‘Overview — Numba 0.50.1 documentation’, ‘#cudatoolkit-lookup’)
warnings.warn(errors.NumbaWarning(msg))
2022-06-23 09:26:41,758 [INFO] iva.common.export.keras_exporter: Using input nodes: [‘input_1’]
2022-06-23 09:26:41,759 [INFO] iva.common.export.keras_exporter: Using output nodes: [‘predictions/Softmax’]
NOTE: UFF has been tested with TensorFlow 1.14.0.
WARNING: The version of TensorFlow installed on this system is not guaranteed to work with UFF.
DEBUG: convert reshape to flatten node
DEBUG [/usr/local/lib/python3.6/dist-packages/uff/converters/tensorflow/converter.py:96] Marking [‘predictions/Softmax’] as outputs
2022-06-23 09:26:46,204 [INFO] iva.common.export.keras_exporter: Calibration takes time especially if number of batches is large.
terminate called after throwing an instance of ‘pybind11::error_already_set’
what(): ValueError: Batch size yielded from data source 8 < requested batch size from calibrator 16

At:
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/tensorfile_calibrator.py(79): get_data_from_source
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/tensorfile_calibrator.py(95): get_batch
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py(537): init
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py(696): init
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/keras_exporter.py(445): export
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py(250): run_export
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/export.py(42): main
/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/export.py(46):

When I run the EXPORT command. Do you know how to fix this?

I believe this is a new topic. Please create a new topic and share the full command and log. Thanks.

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.