6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying

Because we provide a 18-points json file for training. Before training, pose_config_path is updated.
See

You can download jupyter notebook for reference.
https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_quick_start_guide.html#download-jupyter-notebook
or TAO Toolkit Computer Vision Sample Workflows | NVIDIA NGC

Using custom datasets.
ValueError: last step must be > 0. It is 0

180 training sets
20 test sets
this my data_val.json
keypoints_val.json (29.4 KB)

Generate TFRecords for training dataset

!tao bpnet dataset_convert \
-m ‘train’ \
-o $DATA_DIR/train \
–generate_masks \
–dataset_spec $DATA_POSE_SPECS_DIR/coco_spec.json

OUT:
2022-01-19 16:07:32,338 [INFO] root: Registry: [‘nvcr.io’]
2022-01-19 16:07:32,537 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2022-01-19 16:07:33,716 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/chenhongzhao/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
2022-01-19 08:07:35.909337: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

2022-01-19 08:07:40,332 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

2022-01-19 08:07:45,366 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

loading annotations into memory…
Done (t=0.01s)
creating index…
index created!
100%|██████████████████████████████████████| 171/171 [00:00<00:00, 21069.47it/s]
loading annotations into memory…
Done (t=0.00s)
creating index…
index created!
100%|████████████████████████████████████████| 19/19 [00:00<00:00, 21144.01it/s]
INFO 2022-01-19 08:07:46,300 | driveix.bpnet.dataio.coco_dataset: Mask Generation: 0/171
100%|█████████████████████████████████████| 171/171 [00:00<00:00, 396915.32it/s]
INFO 2022-01-19 08:07:47,078 | driveix.bpnet.dataio.dataset_converter_lib: Writing partition 0, shard 0
0it [00:00, ?it/s]
INFO 2022-01-19 08:07:47,082 | driveix.bpnet.dataio.dataset_converter_lib: Wrote the following numbers of objects: 0

INFO 2022-01-19 08:07:47,082 | driveix.bpnet.dataio.dataset_converter_lib: Wrote the following numbers of objects: 0

2022-01-19 16:07:48,841 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Is it a “segmentation” problem?
I only set quadrilateral

hi, Morganh.
Does image resolution affect training
“images”: [“height”: 1080,“width”: 1920]

I set the “area” ,and solved the problem。
But I have new problems.
This is my dataset.Make according to notebook
keypoints_train.json (317.0 KB)

I encountered a mistake during training(batch_size=1):
2022-01-20 10:59:07,277 [INFO] root: Registry: [‘nvcr.io’]
2022-01-20 10:59:07,471 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2022-01-20 10:59:08,543 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/chenhongzhao/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
2022-01-20 02:59:10.763322: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

2022-01-20 02:59:15,408 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

2022-01-20 02:59:21,195 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/train.py:91: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING 2022-01-20 02:59:21,196| tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/train.py:91: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/train.py:91: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

WARNING 2022-01-20 02:59:21,196| tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/train.py:91: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

/usr/local/lib/python3.6/dist-packages/driveix/bpnet/scripts/train.py:110: YAMLLoadWarning: calling yaml.load() without Loader=… is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
/workspace/tao-experiments/bpnet/models/exp_m1_unpruned
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/dataloaders/bpnet_dataloader.py:484: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

WARNING 2022-01-20 02:59:22,746| tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/dataloaders/bpnet_dataloader.py:484: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

INFO 2022-01-20 02:59:22,757| main: done
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING 2022-01-20 02:59:22,757| tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

/workspace/tao-experiments/bpnet/data/train-fold-000-of-001: 178
Total Samples: 178
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/dataloaders/bpnet_dataloader.py:319: The name tf.matrix_inverse is deprecated. Please use tf.linalg.inv instead.

WARNING 2022-01-20 02:59:22,848| tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/dataloaders/bpnet_dataloader.py:319: The name tf.matrix_inverse is deprecated. Please use tf.linalg.inv instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/dataloaders/bpnet_dataloader.py:224: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

WARNING 2022-01-20 02:59:22,878| tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/dataloaders/bpnet_dataloader.py:224: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

INFO 2022-01-20 02:59:22,893| driveix.bpnet.trainers.bpnet_trainer: Building model graph from model defintion …
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING 2022-01-20 02:59:22,895| tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING 2022-01-20 02:59:22,918| tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4115: The name tf.random_normal is deprecated. Please use tf.random.normal instead.

WARNING 2022-01-20 02:59:23,250| tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4115: The name tf.random_normal is deprecated. Please use tf.random.normal instead.

INFO 2022-01-20 02:59:23,643| driveix.bpnet.trainers.bpnet_trainer: Not first run and not finetuning experiment → Loading from latest checkpoint…
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/losses/bpnet_loss.py:120: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

WARNING 2022-01-20 02:59:23,654| tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/losses/bpnet_loss.py:120: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

INFO 2022-01-20 02:59:26,849| main: training
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:59: The name tf.train.LoggingTensorHook is deprecated. Please use tf.estimator.LoggingTensorHook instead.

WARNING 2022-01-20 02:59:26,853| tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:59: The name tf.train.LoggingTensorHook is deprecated. Please use tf.estimator.LoggingTensorHook instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:60: The name tf.train.StopAtStepHook is deprecated. Please use tf.estimator.StopAtStepHook instead.

WARNING 2022-01-20 02:59:26,853| tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:60: The name tf.train.StopAtStepHook is deprecated. Please use tf.estimator.StopAtStepHook instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:73: The name tf.train.StepCounterHook is deprecated. Please use tf.estimator.StepCounterHook instead.

WARNING 2022-01-20 02:59:26,854| tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:73: The name tf.train.StepCounterHook is deprecated. Please use tf.estimator.StepCounterHook instead.

INFO:tensorflow:Create CheckpointSaverHook.
INFO 2022-01-20 02:59:26,854| tensorflow: Create CheckpointSaverHook.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:99: The name tf.train.SummarySaverHook is deprecated. Please use tf.estimator.SummarySaverHook instead.

WARNING 2022-01-20 02:59:26,854| tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:99: The name tf.train.SummarySaverHook is deprecated. Please use tf.estimator.SummarySaverHook instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/trainers/bpnet_trainer.py:300: The name tf.train.NanTensorHook is deprecated. Please use tf.estimator.NanTensorHook instead.

WARNING 2022-01-20 02:59:26,854| tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/trainers/bpnet_trainer.py:300: The name tf.train.NanTensorHook is deprecated. Please use tf.estimator.NanTensorHook instead.

INFO:tensorflow:Graph was finalized.
INFO 2022-01-20 02:59:36,833| tensorflow: Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpmdw_ct22/model.ckpt-115240
INFO 2022-01-20 02:59:37,243| tensorflow: Restoring parameters from /tmp/tmpmdw_ct22/model.ckpt-115240
INFO:tensorflow:Running local_init_op.
INFO 2022-01-20 02:59:38,394| tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO 2022-01-20 02:59:38,531| tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-115240.
INFO 2022-01-20 03:00:21,579| tensorflow: Saving checkpoints for step-115240.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING 2022-01-20 03:00:33,133| tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: assertion failed: [32.3707886]
[[{{node Optimizer/Assert/AssertGuard/Assert}}]]
(1) Invalid argument: assertion failed: [32.3707886]
[[{{node Optimizer/Assert/AssertGuard/Assert}}]]
[[Model/block_3d_bn_1/AssignMovingAvg_1/_2705]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/train.py”, line 146, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/train.py”, line 137, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/trainers/bpnet_trainer.py”, line 316, in train
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/trainers/trainer.py”, line 119, in run_training_loop
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 754, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1360, in run
raise six.reraise(*original_exc_info)
File “/usr/local/lib/python3.6/dist-packages/six.py”, line 696, in reraise
raise value
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1345, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1418, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1176, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: assertion failed: [32.3707886]
[[node Optimizer/Assert/AssertGuard/Assert (defined at usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Invalid argument: assertion failed: [32.3707886]
[[node Optimizer/Assert/AssertGuard/Assert (defined at usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[Model/block_3d_bn_1/AssignMovingAvg_1/_2705]]
0 successful operations.
0 derived errors ignored.

Original stack trace for ‘Optimizer/Assert/AssertGuard/Assert’:
File “root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/train.py”, line 146, in
File “root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/train.py”, line 134, in main
File “root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/trainers/bpnet_trainer.py”, line 255, in build
File “root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/trainers/bpnet_trainer.py”, line 245, in _build_distributed
File “root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/optimizers/weighted_momentum_optimizer.py”, line 55, in build
File “root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/learning_rate_schedules/softstart_annealing_schedule.py”, line 114, in get_tensor
File “root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/hooks/utils.py”, line 40, in get_softstart_annealing_learning_rate
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/tf_should_use.py”, line 198, in wrapped
return _add_should_use_warning(fn(*args, **kwargs))
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py”, line 173, in Assert
guarded_assert = cond(condition, no_op, true_assert, name=“AssertGuard”)
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 513, in new_func
return func(*args, **kwargs)
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py”, line 1235, in cond
orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py”, line 1061, in BuildCondBranch
original_result = fn()
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py”, line 171, in true_assert
condition, data, summarize, name=“Assert”)
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_logging_ops.py”, line 74, in _assert
name=name)
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 513, in new_func
return func(*args, **kwargs)
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

Traceback (most recent call last):
File “/usr/local/bin/bpnet”, line 8, in
sys.exit(main())
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/entrypoint/bpnet.py”, line 12, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/common/entrypoint/entrypoint.py”, line 300, in launch_job
AssertionError: Process run failed.
2022-01-20 11:00:47,788 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
--------------------------------
if i set batch_size=2.
The error message will change:

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: assertion failed: [64.7415771]
[[{{node Optimizer/Assert/AssertGuard/Assert}}]]
(1) Invalid argument: assertion failed: [64.7415771]
[[{{node Optimizer/Assert/AssertGuard/Assert}}]]
[[Optimizer/Exp/_2677]]

--------------------------------
it’s too hard, --!, please help me.

Please create a new topic.
I think we already fix original issue and also some more new issues.

ok

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.