BodyPoseNet training not converging

neuroSparK · July 13, 2021, 12:38pm

Please provide the following information when requesting support.

• Hardware T4
• Network Type (BodyPoseNet)
• TLT Version: docker_tag: v3.0-py3
• Training spec file: Same as example
• How to reproduce the issue : https://developer.nvidia.com/blog/training-optimizing-2d-pose-estimation-model-with-tlt-part-1

I have been trying to train a bpnet model using TLT. I have been following the tutorial as mentioned but during training , after 35 epochs, the loss is still over 300. It doesn’t seem okey to me. I did exactly what the tutorial depicted step-by-step with COCO dataset. Here is the training output-

2021-07-13 12:40:11,187 [INFO] root: Registry: ['nvcr.io']
2021-07-13 12:40:11,252 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2021-07-13 12:40:12.338733: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/train.py:91: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING 2021-07-13 12:40:19,167| tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/train.py:91: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/train.py:91: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

WARNING 2021-07-13 12:40:19,167| tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/train.py:91: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

/workspace/tlt-experiments/bpnet/models/exp_m1_unpruned
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING 2021-07-13 12:40:19,194| tensorflow: From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING 2021-07-13 12:40:19,194| tensorflow: From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/dataloaders/bpnet_dataloader.py:484: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

WARNING 2021-07-13 12:40:19,731| tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/dataloaders/bpnet_dataloader.py:484: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

INFO    2021-07-13 12:40:19,740| __main__: done
/workspace/tlt-experiments/bpnet/data/train-fold-000-of-001: 115254
Total Samples: 115254
WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/dataloaders/bpnet_dataloader.py:319: The name tf.matrix_inverse is deprecated. Please use tf.linalg.inv instead.

WARNING 2021-07-13 12:40:20,117| tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/dataloaders/bpnet_dataloader.py:319: The name tf.matrix_inverse is deprecated. Please use tf.linalg.inv instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/dataloaders/bpnet_dataloader.py:224: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

WARNING 2021-07-13 12:40:20,133| tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/dataloaders/bpnet_dataloader.py:224: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

INFO    2021-07-13 12:40:20,686| driveix.bpnet.trainers.bpnet_trainer: Building model graph from model defintion ...
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING 2021-07-13 12:40:20,688| tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING 2021-07-13 12:40:20,706| tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4115: The name tf.random_normal is deprecated. Please use tf.random.normal instead.

WARNING 2021-07-13 12:40:20,944| tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4115: The name tf.random_normal is deprecated. Please use tf.random.normal instead.

INFO    2021-07-13 12:40:21,233| driveix.bpnet.trainers.bpnet_trainer: Not first run and not finetuning experiment ->                         Loading from latest checkpoint...
WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/losses/bpnet_loss.py:120: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

WARNING 2021-07-13 12:40:21,241| tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/losses/bpnet_loss.py:120: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

INFO    2021-07-13 12:40:22,966| __main__: training
WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:59: The name tf.train.LoggingTensorHook is deprecated. Please use tf.estimator.LoggingTensorHook instead.

WARNING 2021-07-13 12:40:22,969| tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:59: The name tf.train.LoggingTensorHook is deprecated. Please use tf.estimator.LoggingTensorHook instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:60: The name tf.train.StopAtStepHook is deprecated. Please use tf.estimator.StopAtStepHook instead.

WARNING 2021-07-13 12:40:22,970| tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:60: The name tf.train.StopAtStepHook is deprecated. Please use tf.estimator.StopAtStepHook instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:73: The name tf.train.StepCounterHook is deprecated. Please use tf.estimator.StepCounterHook instead.

WARNING 2021-07-13 12:40:22,970| tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:73: The name tf.train.StepCounterHook is deprecated. Please use tf.estimator.StepCounterHook instead.

INFO:tensorflow:Create CheckpointSaverHook.
INFO    2021-07-13 12:40:22,970| tensorflow: Create CheckpointSaverHook.
WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:99: The name tf.train.SummarySaverHook is deprecated. Please use tf.estimator.SummarySaverHook instead.

WARNING 2021-07-13 12:40:22,970| tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:99: The name tf.train.SummarySaverHook is deprecated. Please use tf.estimator.SummarySaverHook instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/trainers/bpnet_trainer.py:300: The name tf.train.NanTensorHook is deprecated. Please use tf.estimator.NanTensorHook instead.

WARNING 2021-07-13 12:40:22,970| tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/trainers/bpnet_trainer.py:300: The name tf.train.NanTensorHook is deprecated. Please use tf.estimator.NanTensorHook instead.

INFO:tensorflow:Graph was finalized.
INFO    2021-07-13 12:40:30,584| tensorflow: Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpl2px3e15/model.ckpt-403375
INFO    2021-07-13 12:40:30,830| tensorflow: Restoring parameters from /tmp/tmpl2px3e15/model.ckpt-403375
INFO:tensorflow:Running local_init_op.
INFO    2021-07-13 12:40:31,610| tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO    2021-07-13 12:40:31,714| tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-403375.
INFO    2021-07-13 12:41:03,860| tensorflow: Saving checkpoints for step-403375.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING 2021-07-13 12:41:17,505| tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

INFO:tensorflow:epoch = 35.0, loss = 478.4702, step = 403375
INFO    2021-07-13 12:41:32,499| tensorflow: epoch = 35.0, loss = 478.4702, step = 403375
WARNING 2021-07-13 12:41:34,912| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-13 12:41:34,917| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 1.56667
INFO    2021-07-13 12:41:45,267| tensorflow: global_step/sec: 1.56667
INFO:tensorflow:global_step/sec: 2.16212
INFO    2021-07-13 12:41:54,517| tensorflow: global_step/sec: 2.16212
INFO:tensorflow:epoch = 35.00511930585683, loss = 463.2732, step = 403434 (32.497 sec)
INFO    2021-07-13 12:42:04,997| tensorflow: epoch = 35.00511930585683, loss = 463.2732, step = 403434 (32.497 sec)
INFO:tensorflow:global_step/sec: 1.82919
INFO    2021-07-13 12:42:05,451| tensorflow: global_step/sec: 1.82919
INFO:tensorflow:global_step/sec: 2.12381
INFO    2021-07-13 12:42:14,868| tensorflow: global_step/sec: 2.12381
INFO:tensorflow:global_step/sec: 2.14787
INFO    2021-07-13 12:42:24,179| tensorflow: global_step/sec: 2.14787
INFO:tensorflow:global_step/sec: 2.15127
INFO    2021-07-13 12:42:33,476| tensorflow: global_step/sec: 2.15127
INFO:tensorflow:epoch = 35.01084598698482, loss = 347.65714, step = 403500 (30.881 sec)
INFO    2021-07-13 12:42:35,877| tensorflow: epoch = 35.01084598698482, loss = 347.65714, step = 403500 (30.881 sec)
WARNING 2021-07-13 12:42:35,991| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-13 12:42:35,997| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.10133
INFO    2021-07-13 12:42:42,994| tensorflow: global_step/sec: 2.10133
WARNING 2021-07-13 12:42:47,360| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-13 12:42:47,367| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.10425
INFO    2021-07-13 12:42:52,499| tensorflow: global_step/sec: 2.10425
INFO:tensorflow:global_step/sec: 2.11084
INFO    2021-07-13 12:43:01,974| tensorflow: global_step/sec: 2.11084
INFO:tensorflow:epoch = 35.01648590021692, loss = 343.23376, step = 403565 (30.972 sec)
INFO    2021-07-13 12:43:06,849| tensorflow: epoch = 35.01648590021692, loss = 343.23376, step = 403565 (30.972 sec)
INFO:tensorflow:global_step/sec: 2.06971
INFO    2021-07-13 12:43:11,637| tensorflow: global_step/sec: 2.06971
INFO:tensorflow:global_step/sec: 2.01006
INFO    2021-07-13 12:43:21,587| tensorflow: global_step/sec: 2.01006

Morganh · July 14, 2021, 5:26am

The loss seems to be keeping decreasing. How about the latest result?

neuroSparK · July 14, 2021, 12:17pm

It doesn’t decreasing constantly, rather fluctuating around 300 after even 50 epochs

INFO:tensorflow:global_step/sec: 2.02646
INFO    2021-07-14 11:46:43,420| tensorflow: global_step/sec: 2.02646
WARNING 2021-07-14 11:46:46,421| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.02331
INFO    2021-07-14 11:46:53,304| tensorflow: global_step/sec: 2.02331
INFO:tensorflow:global_step/sec: 2.01462
INFO    2021-07-14 11:47:03,232| tensorflow: global_step/sec: 2.01462
INFO:tensorflow:epoch = 49.67479392624729, loss = 377.7373, step = 572502 (30.775 sec)
INFO    2021-07-14 11:47:06,752| tensorflow: epoch = 49.67479392624729, loss = 377.7373, step = 572502 (30.775 sec)
INFO:tensorflow:global_step/sec: 2.01858
INFO    2021-07-14 11:47:13,140| tensorflow: global_step/sec: 2.01858
INFO:tensorflow:global_step/sec: 2.01692
INFO    2021-07-14 11:47:23,056| tensorflow: global_step/sec: 2.01692
WARNING 2021-07-14 11:47:29,977| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-14 11:47:29,977| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.04329
INFO    2021-07-14 11:47:32,844| tensorflow: global_step/sec: 2.04329
INFO:tensorflow:epoch = 49.68017353579176, loss = 243.96678, step = 572564 (30.601 sec)
INFO    2021-07-14 11:47:37,353| tensorflow: epoch = 49.68017353579176, loss = 243.96678, step = 572564 (30.601 sec)
WARNING 2021-07-14 11:47:39,012| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-14 11:47:39,017| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.

INFO:tensorflow:global_step/sec: 2.01361
INFO    2021-07-14 11:47:42,776| tensorflow: global_step/sec: 2.01361
INFO:tensorflow:global_step/sec: 2.02996
INFO    2021-07-14 11:47:52,629| tensorflow: global_step/sec: 2.02996
INFO:tensorflow:global_step/sec: 2.0181
INFO    2021-07-14 11:48:02,539| tensorflow: global_step/sec: 2.0181
INFO:tensorflow:epoch = 49.68555314533623, loss = 373.45892, step = 572626 (30.785 sec)
INFO    2021-07-14 11:48:08,138| tensorflow: epoch = 49.68555314533623, loss = 373.45892, step = 572626 (30.785 sec)
INFO:tensorflow:global_step/sec: 1.99067
INFO    2021-07-14 11:48:12,586| tensorflow: global_step/sec: 1.99067
INFO:tensorflow:global_step/sec: 2.03931
INFO    2021-07-14 11:48:22,393| tensorflow: global_step/sec: 2.03931
INFO:tensorflow:global_step/sec: 2.00862
INFO    2021-07-14 11:48:32,350| tensorflow: global_step/sec: 2.00862
INFO:tensorflow:epoch = 49.6909327548807, loss = 308.17642, step = 572688 (30.518 sec)
INFO    2021-07-14 11:48:38,656| tensorflow: epoch = 49.6909327548807, loss = 308.17642, step = 572688 (30.518 sec)
INFO:tensorflow:global_step/sec: 2.04974
INFO    2021-07-14 11:48:42,108| tensorflow: global_step/sec: 2.04974
INFO:tensorflow:global_step/sec: 2.03353
INFO    2021-07-14 11:48:51,943| tensorflow: global_step/sec: 2.03353
INFO:tensorflow:global_step/sec: 1.98599
INFO    2021-07-14 11:49:02,013| tensorflow: global_step/sec: 1.98599
INFO:tensorflow:epoch = 49.69631236442516, loss = 275.7372, step = 572750 (30.802 sec)
INFO    2021-07-14 11:49:09,458| tensorflow: epoch = 49.69631236442516, loss = 275.7372, step = 572750 (30.802 sec)
WARNING 2021-07-14 11:49:11,635| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.00839
INFO    2021-07-14 11:49:11,972| tensorflow: global_step/sec: 2.00839
INFO:tensorflow:global_step/sec: 1.99487
INFO    2021-07-14 11:49:21,997| tensorflow: global_step/sec: 1.99487
INFO:tensorflow:global_step/sec: 2.02469
INFO    2021-07-14 11:49:31,875| tensorflow: global_step/sec: 2.02469
INFO:tensorflow:epoch = 49.70169197396963, loss = 422.6612, step = 572812 (30.945 sec)
INFO    2021-07-14 11:49:40,402| tensorflow: epoch = 49.70169197396963, loss = 422.6612, step = 572812 (30.945 sec)
INFO:tensorflow:global_step/sec: 1.98905
INFO    2021-07-14 11:49:41,930| tensorflow: global_step/sec: 1.98905
WARNING 2021-07-14 11:49:43,124| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-14 11:49:43,128| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.04168
INFO    2021-07-14 11:49:51,726| tensorflow: global_step/sec: 2.04168
INFO:tensorflow:global_step/sec: 2.01556
INFO    2021-07-14 11:50:01,649| tensorflow: global_step/sec: 2.01556
INFO:tensorflow:epoch = 49.7070715835141, loss = 425.76947, step = 572874 (30.704 sec)
INFO    2021-07-14 11:50:11,106| tensorflow: epoch = 49.7070715835141, loss = 425.76947, step = 572874 (30.704 sec)
INFO:tensorflow:global_step/sec: 2.00597
INFO    2021-07-14 11:50:11,619| tensorflow: global_step/sec: 2.00597
INFO:tensorflow:global_step/sec: 2.0347
INFO    2021-07-14 11:50:21,449| tensorflow: global_step/sec: 2.0347
INFO:tensorflow:global_step/sec: 2.01901
INFO    2021-07-14 11:50:31,355| tensorflow: global_step/sec: 2.01901
WARNING 2021-07-14 11:50:32,531| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-14 11:50:32,547| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.00478
INFO    2021-07-14 11:50:41,331| tensorflow: global_step/sec: 2.00478
INFO:tensorflow:epoch = 49.71245119305857, loss = 427.537, step = 572936 (30.735 sec)
INFO    2021-07-14 11:50:41,842| tensorflow: epoch = 49.71245119305857, loss = 427.537, step = 572936 (30.735 sec)
INFO:tensorflow:global_step/sec: 2.04479
INFO    2021-07-14 11:50:51,112| tensorflow: global_step/sec: 2.04479
INFO:tensorflow:global_step/sec: 2.00303
INFO    2021-07-14 11:51:01,097| tensorflow: global_step/sec: 2.00303
INFO:tensorflow:global_step/sec: 2.03575
INFO    2021-07-14 11:51:10,921| tensorflow: global_step/sec: 2.03575
INFO:tensorflow:epoch = 49.71783080260304, loss = 260.56018, step = 572998 (30.548 sec)
INFO    2021-07-14 11:51:12,390| tensorflow: epoch = 49.71783080260304, loss = 260.56018, step = 572998 (30.548 sec)
INFO:tensorflow:global_step/sec: 2.00694
INFO    2021-07-14 11:51:20,886| tensorflow: global_step/sec: 2.00694
INFO:tensorflow:global_step/sec: 2.03784
INFO    2021-07-14 11:51:30,701| tensorflow: global_step/sec: 2.03784
INFO:tensorflow:global_step/sec: 2.02651
INFO    2021-07-14 11:51:40,570| tensorflow: global_step/sec: 2.02651
INFO:tensorflow:epoch = 49.72321041214751, loss = 346.5287, step = 573060 (30.639 sec)
INFO    2021-07-14 11:51:43,029| tensorflow: epoch = 49.72321041214751, loss = 346.5287, step = 573060 (30.639 sec)
INFO:tensorflow:global_step/sec: 2.01729
INFO    2021-07-14 11:51:50,484| tensorflow: global_step/sec: 2.01729
INFO:tensorflow:global_step/sec: 2.00778
INFO    2021-07-14 11:52:00,445| tensorflow: global_step/sec: 2.00778
INFO:tensorflow:global_step/sec: 2.00152
INFO    2021-07-14 11:52:10,438| tensorflow: global_step/sec: 2.00152
INFO:tensorflow:epoch = 49.72859002169197, loss = 248.50562, step = 573122 (30.896 sec)
INFO    2021-07-14 11:52:13,925| tensorflow: epoch = 49.72859002169197, loss = 248.50562, step = 573122 (30.896 sec)
INFO:tensorflow:global_step/sec: 2.00822
INFO    2021-07-14 11:52:20,397| tensorflow: global_step/sec: 2.00822
WARNING 2021-07-14 11:52:29,375| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-14 11:52:29,383| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.03194
INFO    2021-07-14 11:52:30,240| tensorflow: global_step/sec: 2.03194
INFO:tensorflow:global_step/sec: 2.01979
INFO    2021-07-14 11:52:40,142| tensorflow: global_step/sec: 2.01979
INFO:tensorflow:epoch = 49.73396963123644, loss = 360.45462, step = 573184 (30.569 sec)
INFO    2021-07-14 11:52:44,494| tensorflow: epoch = 49.73396963123644, loss = 360.45462, step = 573184 (30.569 sec)
INFO:tensorflow:global_step/sec: 2.04706
INFO    2021-07-14 11:52:49,912| tensorflow: global_step/sec: 2.04706
WARNING 2021-07-14 11:52:50,032| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-14 11:52:50,041| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 1.99727
INFO    2021-07-14 11:52:59,925| tensorflow: global_step/sec: 1.99727
INFO:tensorflow:global_step/sec: 2.03147
INFO    2021-07-14 11:53:09,770| tensorflow: global_step/sec: 2.03147
INFO:tensorflow:epoch = 49.73934924078091, loss = 382.61694, step = 573246 (30.752 sec)
INFO    2021-07-14 11:53:15,246| tensorflow: epoch = 49.73934924078091, loss = 382.61694, step = 573246 (30.752 sec)
INFO:tensorflow:global_step/sec: 2.0049
INFO    2021-07-14 11:53:19,746| tensorflow: global_step/sec: 2.0049
WARNING 2021-07-14 11:53:22,363| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-14 11:53:22,364| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.03181
INFO    2021-07-14 11:53:29,590| tensorflow: global_step/sec: 2.03181
INFO:tensorflow:global_step/sec: 2.04043
INFO    2021-07-14 11:53:39,391| tensorflow: global_step/sec: 2.04043
INFO:tensorflow:epoch = 49.74472885032538, loss = 341.30624, step = 573308 (30.520 sec)
INFO    2021-07-14 11:53:45,766| tensorflow: epoch = 49.74472885032538, loss = 341.30624, step = 573308 (30.520 sec)
INFO:tensorflow:global_step/sec: 2.01942
INFO    2021-07-14 11:53:49,295| tensorflow: global_step/sec: 2.01942
INFO:tensorflow:global_step/sec: 2.03928

INFO    2021-07-14 11:53:59,103| tensorflow: global_step/sec: 2.03928
INFO:tensorflow:global_step/sec: 2.01799
INFO    2021-07-14 11:54:09,013| tensorflow: global_step/sec: 2.01799
INFO:tensorflow:epoch = 49.75010845986985, loss = 265.67908, step = 573370 (30.801 sec)
INFO    2021-07-14 11:54:16,567| tensorflow: epoch = 49.75010845986985, loss = 265.67908, step = 573370 (30.801 sec)
INFO:tensorflow:global_step/sec: 1.98827
INFO    2021-07-14 11:54:19,072| tensorflow: global_step/sec: 1.98827
INFO:tensorflow:global_step/sec: 2.01175
INFO    2021-07-14 11:54:29,014| tensorflow: global_step/sec: 2.01175
INFO:tensorflow:global_step/sec: 2.02494
INFO    2021-07-14 11:54:38,891| tensorflow: global_step/sec: 2.02494
INFO:tensorflow:epoch = 49.75548806941432, loss = 309.47165, step = 573432 (30.711 sec)
INFO    2021-07-14 11:54:47,278| tensorflow: epoch = 49.75548806941432, loss = 309.47165, step = 573432 (30.711 sec)
INFO:tensorflow:global_step/sec: 2.01935
INFO    2021-07-14 11:54:48,795| tensorflow: global_step/sec: 2.01935
INFO:tensorflow:global_step/sec: 2.01986
INFO    2021-07-14 11:54:58,697| tensorflow: global_step/sec: 2.01986
WARNING 2021-07-14 11:55:05,238| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.02174
INFO    2021-07-14 11:55:08,589| tensorflow: global_step/sec: 2.02174
INFO:tensorflow:epoch = 49.76086767895879, loss = 311.08768, step = 573494 (30.753 sec)
INFO    2021-07-14 11:55:18,031| tensorflow: epoch = 49.76086767895879, loss = 311.08768, step = 573494 (30.753 sec)
INFO:tensorflow:global_step/sec: 2.01115
INFO    2021-07-14 11:55:18,534| tensorflow: global_step/sec: 2.01115
INFO:tensorflow:global_step/sec: 2.04908
INFO    2021-07-14 11:55:28,294| tensorflow: global_step/sec: 2.04908
INFO:tensorflow:global_step/sec: 2.04866
INFO    2021-07-14 11:55:38,057| tensorflow: global_step/sec: 2.04866
INFO:tensorflow:global_step/sec: 2.03537
INFO    2021-07-14 11:55:47,883| tensorflow: global_step/sec: 2.03537
INFO:tensorflow:epoch = 49.76633405639913, loss = 325.26196, step = 573557 (30.833 sec)
INFO    2021-07-14 11:55:48,863| tensorflow: epoch = 49.76633405639913, loss = 325.26196, step = 573557 (30.833 sec)
INFO:tensorflow:global_step/sec: 1.99014
INFO    2021-07-14 11:55:57,932| tensorflow: global_step/sec: 1.99014
INFO:tensorflow:global_step/sec: 1.99123
INFO    2021-07-14 11:56:07,976| tensorflow: global_step/sec: 1.99123
INFO:tensorflow:global_step/sec: 2.00374
INFO    2021-07-14 11:56:17,958| tensorflow: global_step/sec: 2.00374
INFO:tensorflow:epoch = 49.77162689804772, loss = 228.00667, step = 573618 (30.596 sec)
INFO    2021-07-14 11:56:19,460| tensorflow: epoch = 49.77162689804772, loss = 228.00667, step = 573618 (30.596 sec)
INFO:tensorflow:global_step/sec: 2.00009
INFO    2021-07-14 11:56:27,957| tensorflow: global_step/sec: 2.00009
INFO:tensorflow:global_step/sec: 2.02237
INFO    2021-07-14 11:56:37,847| tensorflow: global_step/sec: 2.02237
INFO:tensorflow:global_step/sec: 2.0183
INFO    2021-07-14 11:56:47,756| tensorflow: global_step/sec: 2.0183
INFO:tensorflow:epoch = 49.77700650759219, loss = 343.2644, step = 573680 (30.822 sec)
INFO    2021-07-14 11:56:50,282| tensorflow: epoch = 49.77700650759219, loss = 343.2644, step = 573680 (30.822 sec)
WARNING 2021-07-14 11:56:51,396| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.01678
INFO    2021-07-14 11:56:57,673| tensorflow: global_step/sec: 2.01678
INFO:tensorflow:global_step/sec: 2.03026
INFO    2021-07-14 11:57:07,524| tensorflow: global_step/sec: 2.03026
WARNING 2021-07-14 11:57:14,094| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-14 11:57:14,097| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.01901
INFO    2021-07-14 11:57:17,430| tensorflow: global_step/sec: 2.01901
INFO:tensorflow:epoch = 49.78238611713666, loss = 257.6273, step = 573742 (30.631 sec)
INFO    2021-07-14 11:57:20,913| tensorflow: epoch = 49.78238611713666, loss = 257.6273, step = 573742 (30.631 sec)
INFO:tensorflow:global_step/sec: 2.01209
INFO    2021-07-14 11:57:27,369| tensorflow: global_step/sec: 2.01209
WARNING 2021-07-14 11:57:31,969| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-14 11:57:31,971| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.01301
INFO    2021-07-14 11:57:37,305| tensorflow: global_step/sec: 2.01301
INFO:tensorflow:global_step/sec: 2.03356
INFO    2021-07-14 11:57:47,140| tensorflow: global_step/sec: 2.03356
INFO:tensorflow:epoch = 49.78776572668113, loss = 323.45294, step = 573804 (30.650 sec)
INFO    2021-07-14 11:57:51,563| tensorflow: epoch = 49.78776572668113, loss = 323.45294, step = 573804 (30.650 sec)
INFO:tensorflow:global_step/sec: 2.0329
INFO    2021-07-14 11:57:56,978| tensorflow: global_step/sec: 2.0329
INFO:tensorflow:global_step/sec: 2.00639
INFO    2021-07-14 11:58:06,946| tensorflow: global_step/sec: 2.00639
INFO:tensorflow:global_step/sec: 2.03694
INFO    2021-07-14 11:58:16,765| tensorflow: global_step/sec: 2.03694
INFO:tensorflow:epoch = 49.7931453362256, loss = 318.06064, step = 573866 (30.703 sec)
INFO    2021-07-14 11:58:22,266| tensorflow: epoch = 49.7931453362256, loss = 318.06064, step = 573866 (30.703 sec)
INFO:tensorflow:global_step/sec: 2.0015
INFO    2021-07-14 11:58:26,757| tensorflow: global_step/sec: 2.0015
INFO:tensorflow:global_step/sec: 2.00852
INFO    2021-07-14 11:58:36,715| tensorflow: global_step/sec: 2.00852
INFO:tensorflow:global_step/sec: 2.01338
INFO    2021-07-14 11:58:46,648| tensorflow: global_step/sec: 2.01338
INFO:tensorflow:epoch = 49.79852494577007, loss = 282.2003, step = 573928 (30.870 sec)
INFO    2021-07-14 11:58:53,136| tensorflow: epoch = 49.79852494577007, loss = 282.2003, step = 573928 (30.870 sec)
INFO:tensorflow:global_step/sec: 2.00738
INFO    2021-07-14 11:58:56,612| tensorflow: global_step/sec: 2.00738
INFO:tensorflow:global_step/sec: 1.99074
INFO    2021-07-14 11:59:06,658| tensorflow: global_step/sec: 1.99074
INFO:tensorflow:global_step/sec: 2.01434
INFO    2021-07-14 11:59:16,587| tensorflow: global_step/sec: 2.01434
INFO:tensorflow:epoch = 49.80390455531453, loss = 348.13745, step = 573990 (30.725 sec)
INFO    2021-07-14 11:59:23,861| tensorflow: epoch = 49.80390455531453, loss = 348.13745, step = 573990 (30.725 sec)
INFO:tensorflow:global_step/sec: 2.06119
INFO    2021-07-14 11:59:26,290| tensorflow: global_step/sec: 2.06119
INFO:tensorflow:global_step/sec: 2.04145
INFO    2021-07-14 11:59:36,087| tensorflow: global_step/sec: 2.04145
INFO:tensorflow:global_step/sec: 2.0299
INFO    2021-07-14 11:59:45,940| tensorflow: global_step/sec: 2.0299
WARNING 2021-07-14 11:59:50,548| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-14 11:59:50,552| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:epoch = 49.80937093275488, loss = 394.936, step = 574053 (30.984 sec)
INFO    2021-07-14 11:59:54,845| tensorflow: epoch = 49.80937093275488, loss = 394.936, step = 574053 (30.984 sec)
INFO:tensorflow:global_step/sec: 2.00926
INFO    2021-07-14 11:59:55,894| tensorflow: global_step/sec: 2.00926
INFO:tensorflow:global_step/sec: 1.99541
INFO    2021-07-14 12:00:05,917| tensorflow: global_step/sec: 1.99541
INFO:tensorflow:global_step/sec: 2.0351
INFO    2021-07-14 12:00:15,744| tensorflow: global_step/sec: 2.0351
WARNING 2021-07-14 12:00:18,442| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-14 12:00:18,447| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:epoch = 49.81475054229935, loss = 384.68024, step = 574115 (30.910 sec)
INFO    2021-07-14 12:00:25,755| tensorflow: epoch = 49.81475054229935, loss = 384.68024, step = 574115 (30.910 sec)
INFO:tensorflow:global_step/sec: 1.99765
INFO    2021-07-14 12:00:25,756| tensorflow: global_step/sec: 1.99765

INFO:tensorflow:global_step/sec: 2.04244
INFO    2021-07-14 12:00:35,548| tensorflow: global_step/sec: 2.04244
INFO:tensorflow:global_step/sec: 2.03548
INFO    2021-07-14 12:00:45,374| tensorflow: global_step/sec: 2.03548
INFO:tensorflow:global_step/sec: 1.98808
INFO    2021-07-14 12:00:55,434| tensorflow: global_step/sec: 1.98808
INFO:tensorflow:epoch = 49.82013015184382, loss = 308.3121, step = 574177 (30.670 sec)
INFO    2021-07-14 12:00:56,425| tensorflow: epoch = 49.82013015184382, loss = 308.3121, step = 574177 (30.670 sec)
INFO:tensorflow:global_step/sec: 2.01251
INFO    2021-07-14 12:01:05,372| tensorflow: global_step/sec: 2.01251
INFO:tensorflow:global_step/sec: 2.01979
INFO    2021-07-14 12:01:15,274| tensorflow: global_step/sec: 2.01979
INFO:tensorflow:global_step/sec: 2.02158
INFO    2021-07-14 12:01:25,167| tensorflow: global_step/sec: 2.02158
INFO:tensorflow:epoch = 49.82550976138829, loss = 287.13052, step = 574239 (30.733 sec)
INFO    2021-07-14 12:01:27,158| tensorflow: epoch = 49.82550976138829, loss = 287.13052, step = 574239 (30.733 sec)
INFO:tensorflow:global_step/sec: 2.0154
INFO    2021-07-14 12:01:35,091| tensorflow: global_step/sec: 2.0154
INFO:tensorflow:global_step/sec: 2.02849
INFO    2021-07-14 12:01:44,950| tensorflow: global_step/sec: 2.02849
INFO:tensorflow:global_step/sec: 2.01113
INFO    2021-07-14 12:01:54,895| tensorflow: global_step/sec: 2.01113
INFO:tensorflow:epoch = 49.83088937093276, loss = 391.95816, step = 574301 (30.717 sec)
INFO    2021-07-14 12:01:57,875| tensorflow: epoch = 49.83088937093276, loss = 391.95816, step = 574301 (30.717 sec)
INFO:tensorflow:global_step/sec: 2.00616
INFO    2021-07-14 12:02:04,864| tensorflow: global_step/sec: 2.00616
INFO:tensorflow:global_step/sec: 2.03736
INFO    2021-07-14 12:02:14,681| tensorflow: global_step/sec: 2.03736
INFO:tensorflow:global_step/sec: 2.00803
INFO    2021-07-14 12:02:24,641| tensorflow: global_step/sec: 2.00803
INFO:tensorflow:epoch = 49.83626898047722, loss = 296.10175, step = 574363 (30.768 sec)
INFO    2021-07-14 12:02:28,643| tensorflow: epoch = 49.83626898047722, loss = 296.10175, step = 574363 (30.768 sec)
INFO:tensorflow:global_step/sec: 2.02281
INFO    2021-07-14 12:02:34,528| tensorflow: global_step/sec: 2.02281
INFO:tensorflow:global_step/sec: 2.00417
INFO    2021-07-14 12:02:44,507| tensorflow: global_step/sec: 2.00417
INFO:tensorflow:global_step/sec: 2.0061
INFO    2021-07-14 12:02:54,477| tensorflow: global_step/sec: 2.0061
INFO:tensorflow:epoch = 49.84164859002169, loss = 424.7765, step = 574425 (30.889 sec)
INFO    2021-07-14 12:02:59,531| tensorflow: epoch = 49.84164859002169, loss = 424.7765, step = 574425 (30.889 sec)
INFO:tensorflow:global_step/sec: 1.98477
INFO    2021-07-14 12:03:04,553| tensorflow: global_step/sec: 1.98477
INFO:tensorflow:global_step/sec: 2.00499
INFO    2021-07-14 12:03:14,529| tensorflow: global_step/sec: 2.00499
INFO:tensorflow:global_step/sec: 2.00699
INFO    2021-07-14 12:03:24,494| tensorflow: global_step/sec: 2.00699
INFO:tensorflow:epoch = 49.84702819956616, loss = 287.65744, step = 574487 (30.930 sec)
INFO    2021-07-14 12:03:30,461| tensorflow: epoch = 49.84702819956616, loss = 287.65744, step = 574487 (30.930 sec)
INFO:tensorflow:global_step/sec: 1.99962
INFO    2021-07-14 12:03:34,496| tensorflow: global_step/sec: 1.99962
INFO:tensorflow:global_step/sec: 2.04086
INFO    2021-07-14 12:03:44,295| tensorflow: global_step/sec: 2.04086
INFO:tensorflow:global_step/sec: 2.05059
INFO    2021-07-14 12:03:54,049| tensorflow: global_step/sec: 2.05059
INFO:tensorflow:epoch = 49.85240780911063, loss = 251.82747, step = 574549 (30.597 sec)
INFO    2021-07-14 12:04:01,058| tensorflow: epoch = 49.85240780911063, loss = 251.82747, step = 574549 (30.597 sec)
INFO:tensorflow:global_step/sec: 2.01038
INFO    2021-07-14 12:04:03,997| tensorflow: global_step/sec: 2.01038
INFO:tensorflow:global_step/sec: 2.03017
INFO    2021-07-14 12:04:13,848| tensorflow: global_step/sec: 2.03017
INFO:tensorflow:global_step/sec: 2.01946
INFO    2021-07-14 12:04:23,752| tensorflow: global_step/sec: 2.01946
INFO:tensorflow:epoch = 49.8577874186551, loss = 274.3741, step = 574611 (30.611 sec)
INFO    2021-07-14 12:04:31,669| tensorflow: epoch = 49.8577874186551, loss = 274.3741, step = 574611 (30.611 sec)
INFO:tensorflow:global_step/sec: 2.00924
INFO    2021-07-14 12:04:33,706| tensorflow: global_step/sec: 2.00924
WARNING 2021-07-14 12:04:38,385| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-14 12:04:38,388| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.01874
INFO    2021-07-14 12:04:43,613| tensorflow: global_step/sec: 2.01874
INFO:tensorflow:global_step/sec: 2.01981
INFO    2021-07-14 12:04:53,515| tensorflow: global_step/sec: 2.01981
INFO:tensorflow:epoch = 49.86316702819957, loss = 356.75604, step = 574673 (30.878 sec)
INFO    2021-07-14 12:05:02,547| tensorflow: epoch = 49.86316702819957, loss = 356.75604, step = 574673 (30.878 sec)
INFO:tensorflow:global_step/sec: 1.99107
INFO    2021-07-14 12:05:03,560| tensorflow: global_step/sec: 1.99107
INFO:tensorflow:global_step/sec: 1.98041
INFO    2021-07-14 12:05:13,659| tensorflow: global_step/sec: 1.98041
INFO:tensorflow:global_step/sec: 2.00966
INFO    2021-07-14 12:05:23,611| tensorflow: global_step/sec: 2.00966
INFO:tensorflow:epoch = 49.86854663774403, loss = 375.99954, step = 574735 (30.935 sec)
INFO    2021-07-14 12:05:33,482| tensorflow: epoch = 49.86854663774403, loss = 375.99954, step = 574735 (30.935 sec)
INFO:tensorflow:global_step/sec: 2.0259
INFO    2021-07-14 12:05:33,483| tensorflow: global_step/sec: 2.0259
WARNING 2021-07-14 12:05:37,058| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-14 12:05:37,064| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.02581
INFO    2021-07-14 12:05:43,356| tensorflow: global_step/sec: 2.02581
INFO:tensorflow:global_step/sec: 2.01283
INFO    2021-07-14 12:05:53,292| tensorflow: global_step/sec: 2.01283
INFO:tensorflow:global_step/sec: 2.0121
INFO    2021-07-14 12:06:03,232| tensorflow: global_step/sec: 2.0121
INFO:tensorflow:epoch = 49.8739262472885, loss = 307.5229, step = 574797 (30.749 sec)
INFO    2021-07-14 12:06:04,231| tensorflow: epoch = 49.8739262472885, loss = 307.5229, step = 574797 (30.749 sec)
INFO:tensorflow:global_step/sec: 1.99479
INFO    2021-07-14 12:06:13,258| tensorflow: global_step/sec: 1.99479
INFO:tensorflow:global_step/sec: 2.01506
INFO    2021-07-14 12:06:23,183| tensorflow: global_step/sec: 2.01506
INFO:tensorflow:global_step/sec: 2.00843
INFO    2021-07-14 12:06:33,141| tensorflow: global_step/sec: 2.00843
INFO:tensorflow:epoch = 49.87930585683297, loss = 385.00064, step = 574859 (30.890 sec)
INFO    2021-07-14 12:06:35,121| tensorflow: epoch = 49.87930585683297, loss = 385.00064, step = 574859 (30.890 sec)
INFO:tensorflow:global_step/sec: 2.02971
INFO    2021-07-14 12:06:42,995| tensorflow: global_step/sec: 2.02971
INFO:tensorflow:global_step/sec: 1.99611
INFO    2021-07-14 12:06:53,014| tensorflow: global_step/sec: 1.99611
WARNING 2021-07-14 12:06:54,201| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-14 12:06:54,203| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.01737
INFO    2021-07-14 12:07:02,928| tensorflow: global_step/sec: 2.01737
INFO:tensorflow:epoch = 49.88468546637744, loss = 290.95587, step = 574921 (30.817 sec)
INFO    2021-07-14 12:07:05,938| tensorflow: epoch = 49.88468546637744, loss = 290.95587, step = 574921 (30.817 sec)
INFO:tensorflow:global_step/sec: 2.0206
INFO    2021-07-14 12:07:12,826| tensorflow: global_step/sec: 2.0206
INFO:tensorflow:global_step/sec: 1.98417
INFO    2021-07-14 12:07:22,906| tensorflow: global_step/sec: 1.98417
INFO:tensorflow:global_step/sec: 2.02491
INFO    2021-07-14 12:07:32,783| tensorflow: global_step/sec: 2.02491

INFO:tensorflow:epoch = 49.89006507592191, loss = 319.05347, step = 574983 (30.896 sec)
INFO    2021-07-14 12:07:36,834| tensorflow: epoch = 49.89006507592191, loss = 319.05347, step = 574983 (30.896 sec)
INFO:tensorflow:global_step/sec: 1.9909
INFO    2021-07-14 12:07:42,829| tensorflow: global_step/sec: 1.9909
INFO:tensorflow:global_step/sec: 1.99001
INFO    2021-07-14 12:07:52,879| tensorflow: global_step/sec: 1.99001
INFO:tensorflow:global_step/sec: 2.00179
INFO    2021-07-14 12:08:02,870| tensorflow: global_step/sec: 2.00179
INFO:tensorflow:epoch = 49.89544468546638, loss = 341.0909, step = 575045 (30.969 sec)
INFO    2021-07-14 12:08:07,803| tensorflow: epoch = 49.89544468546638, loss = 341.0909, step = 575045 (30.969 sec)
INFO:tensorflow:global_step/sec: 2.02471
INFO    2021-07-14 12:08:12,748| tensorflow: global_step/sec: 2.02471
INFO:tensorflow:global_step/sec: 2.01388
INFO    2021-07-14 12:08:22,679| tensorflow: global_step/sec: 2.01388
INFO:tensorflow:global_step/sec: 2.03184
INFO    2021-07-14 12:08:32,522| tensorflow: global_step/sec: 2.03184
INFO:tensorflow:epoch = 49.90082429501085, loss = 316.31396, step = 575107 (30.675 sec)
INFO    2021-07-14 12:08:38,478| tensorflow: epoch = 49.90082429501085, loss = 316.31396, step = 575107 (30.675 sec)
INFO:tensorflow:global_step/sec: 2.00815
INFO    2021-07-14 12:08:42,482| tensorflow: global_step/sec: 2.00815
WARNING 2021-07-14 12:08:48,096| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.02806
INFO    2021-07-14 12:08:52,343| tensorflow: global_step/sec: 2.02806
INFO:tensorflow:global_step/sec: 2.03016
INFO    2021-07-14 12:09:02,195| tensorflow: global_step/sec: 2.03016
INFO:tensorflow:epoch = 49.90620390455531, loss = 248.74411, step = 575169 (30.659 sec)
INFO    2021-07-14 12:09:09,138| tensorflow: epoch = 49.90620390455531, loss = 248.74411, step = 575169 (30.659 sec)
INFO:tensorflow:global_step/sec: 2.03018
INFO    2021-07-14 12:09:12,046| tensorflow: global_step/sec: 2.03018
INFO:tensorflow:global_step/sec: 2.02133
INFO    2021-07-14 12:09:21,941| tensorflow: global_step/sec: 2.02133
WARNING 2021-07-14 12:09:30,876| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-14 12:09:30,881| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.03927
INFO    2021-07-14 12:09:31,748| tensorflow: global_step/sec: 2.03927
INFO:tensorflow:epoch = 49.91167028199566, loss = 270.14227, step = 575232 (30.968 sec)
INFO    2021-07-14 12:09:40,106| tensorflow: epoch = 49.91167028199566, loss = 270.14227, step = 575232 (30.968 sec)
INFO:tensorflow:global_step/sec: 2.02962
INFO    2021-07-14 12:09:41,602| tensorflow: global_step/sec: 2.02962
INFO:tensorflow:global_step/sec: 2.04689
INFO    2021-07-14 12:09:51,373| tensorflow: global_step/sec: 2.04689
INFO:tensorflow:global_step/sec: 2.01413
INFO    2021-07-14 12:10:01,303| tensorflow: global_step/sec: 2.01413
WARNING 2021-07-14 12:10:08,371| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:epoch = 49.91704989154013, loss = 355.02753, step = 575294 (30.596 sec)
INFO    2021-07-14 12:10:10,702| tensorflow: epoch = 49.91704989154013, loss = 355.02753, step = 575294 (30.596 sec)
INFO:tensorflow:global_step/sec: 2.01972
INFO    2021-07-14 12:10:11,205| tensorflow: global_step/sec: 2.01972
INFO:tensorflow:global_step/sec: 2.03248
INFO    2021-07-14 12:10:21,045| tensorflow: global_step/sec: 2.03248
INFO:tensorflow:global_step/sec: 1.99548
INFO    2021-07-14 12:10:31,068| tensorflow: global_step/sec: 1.99548
INFO:tensorflow:global_step/sec: 2.00512
INFO    2021-07-14 12:10:41,043| tensorflow: global_step/sec: 2.00512
INFO:tensorflow:epoch = 49.9224295010846, loss = 329.8729, step = 575356 (30.849 sec)
INFO    2021-07-14 12:10:41,551| tensorflow: epoch = 49.9224295010846, loss = 329.8729, step = 575356 (30.849 sec)
INFO:tensorflow:global_step/sec: 2.00955
INFO    2021-07-14 12:10:50,995| tensorflow: global_step/sec: 2.00955
WARNING 2021-07-14 12:10:57,572| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-14 12:10:57,581| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.0013
INFO    2021-07-14 12:11:00,989| tensorflow: global_step/sec: 2.0013
INFO:tensorflow:global_step/sec: 2.02503
INFO    2021-07-14 12:11:10,865| tensorflow: global_step/sec: 2.02503
INFO:tensorflow:epoch = 49.92780911062907, loss = 356.36877, step = 575418 (30.815 sec)
INFO    2021-07-14 12:11:12,366| tensorflow: epoch = 49.92780911062907, loss = 356.36877, step = 575418 (30.815 sec)
INFO:tensorflow:global_step/sec: 1.99549
INFO    2021-07-14 12:11:20,888| tensorflow: global_step/sec: 1.99549
INFO:tensorflow:global_step/sec: 2.02319
INFO    2021-07-14 12:11:30,773| tensorflow: global_step/sec: 2.02319
INFO:tensorflow:global_step/sec: 2.02573
INFO    2021-07-14 12:11:40,646| tensorflow: global_step/sec: 2.02573
INFO:tensorflow:epoch = 49.93318872017354, loss = 249.58826, step = 575480 (30.768 sec)
INFO    2021-07-14 12:11:43,134| tensorflow: epoch = 49.93318872017354, loss = 249.58826, step = 575480 (30.768 sec)
INFO:tensorflow:global_step/sec: 2.00788
INFO    2021-07-14 12:11:50,607| tensorflow: global_step/sec: 2.00788
WARNING 2021-07-14 12:11:52,696| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-14 12:11:52,703| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.03068
INFO    2021-07-14 12:12:00,456| tensorflow: global_step/sec: 2.03068
INFO:tensorflow:global_step/sec: 2.01335
INFO    2021-07-14 12:12:10,389| tensorflow: global_step/sec: 2.01335
INFO:tensorflow:epoch = 49.938568329718, loss = 400.88022, step = 575542 (30.736 sec)
INFO    2021-07-14 12:12:13,870| tensorflow: epoch = 49.938568329718, loss = 400.88022, step = 575542 (30.736 sec)
INFO:tensorflow:global_step/sec: 1.99171
INFO    2021-07-14 12:12:20,431| tensorflow: global_step/sec: 1.99171
WARNING 2021-07-14 12:12:22,112| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-14 12:12:23,073| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 1.99719
INFO    2021-07-14 12:12:30,445| tensorflow: global_step/sec: 1.99719
INFO:tensorflow:global_step/sec: 2.04474
INFO    2021-07-14 12:12:40,226| tensorflow: global_step/sec: 2.04474
INFO:tensorflow:epoch = 49.94394793926247, loss = 380.66428, step = 575604 (30.937 sec)
INFO    2021-07-14 12:12:44,807| tensorflow: epoch = 49.94394793926247, loss = 380.66428, step = 575604 (30.937 sec)
INFO:tensorflow:global_step/sec: 1.99514
INFO    2021-07-14 12:12:50,251| tensorflow: global_step/sec: 1.99514
INFO:tensorflow:global_step/sec: 2.01204
INFO    2021-07-14 12:13:00,191| tensorflow: global_step/sec: 2.01204
INFO:tensorflow:global_step/sec: 2.01913
INFO    2021-07-14 12:13:10,096| tensorflow: global_step/sec: 2.01913
INFO:tensorflow:epoch = 49.94932754880694, loss = 310.21817, step = 575666 (30.798 sec)
INFO    2021-07-14 12:13:15,605| tensorflow: epoch = 49.94932754880694, loss = 310.21817, step = 575666 (30.798 sec)
INFO:tensorflow:global_step/sec: 2.00206
INFO    2021-07-14 12:13:20,086| tensorflow: global_step/sec: 2.00206
INFO:tensorflow:global_step/sec: 2.00701
INFO    2021-07-14 12:13:30,051| tensorflow: global_step/sec: 2.00701
INFO:tensorflow:global_step/sec: 2.00694
INFO    2021-07-14 12:13:40,016| tensorflow: global_step/sec: 2.00694
INFO:tensorflow:epoch = 49.95470715835141, loss = 470.0162, step = 575728 (30.855 sec)
INFO    2021-07-14 12:13:46,461| tensorflow: epoch = 49.95470715835141, loss = 470.0162, step = 575728 (30.855 sec)
INFO:tensorflow:global_step/sec: 2.00733
INFO    2021-07-14 12:13:49,980| tensorflow: global_step/sec: 2.00733

WARNING 2021-07-14 12:13:59,496| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.03241
INFO    2021-07-14 12:13:59,820| tensorflow: global_step/sec: 2.03241
INFO:tensorflow:global_step/sec: 2.02896
INFO    2021-07-14 12:14:09,678| tensorflow: global_step/sec: 2.02896
WARNING 2021-07-14 12:14:14,804| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
WARNING 2021-07-14 12:14:14,811| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:epoch = 49.96008676789588, loss = 313.15503, step = 575790 (30.733 sec)
INFO    2021-07-14 12:14:17,193| tensorflow: epoch = 49.96008676789588, loss = 313.15503, step = 575790 (30.733 sec)
INFO:tensorflow:global_step/sec: 2.00559
INFO    2021-07-14 12:14:19,650| tensorflow: global_step/sec: 2.00559
WARNING 2021-07-14 12:14:22,300| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 2.03447
INFO    2021-07-14 12:14:29,480| tensorflow: global_step/sec: 2.03447
INFO:tensorflow:global_step/sec: 2.01184
INFO    2021-07-14 12:14:39,421| tensorflow: global_step/sec: 2.01184
INFO:tensorflow:epoch = 49.96546637744035, loss = 257.5868, step = 575852 (30.572 sec)
INFO    2021-07-14 12:14:47,765| tensorflow: epoch = 49.96546637744035, loss = 257.5868, step = 575852 (30.572 sec)
INFO:tensorflow:global_step/sec: 2.02165
INFO    2021-07-14 12:14:49,314| tensorflow: global_step/sec: 2.02165
INFO:tensorflow:global_step/sec: 2.03311
INFO    2021-07-14 12:14:59,151| tensorflow: global_step/sec: 2.03311
INFO:tensorflow:global_step/sec: 2.00988
INFO    2021-07-14 12:15:09,102| tensorflow: global_step/sec: 2.00988
INFO:tensorflow:epoch = 49.97084598698482, loss = 331.0822, step = 575914 (30.799 sec)
INFO    2021-07-14 12:15:18,564| tensorflow: epoch = 49.97084598698482, loss = 331.0822, step = 575914 (30.799 sec)
INFO:tensorflow:global_step/sec: 2.00588
INFO    2021-07-14 12:15:19,073| tensorflow: global_step/sec: 2.00588
INFO:tensorflow:global_step/sec: 2.03391
INFO    2021-07-14 12:15:28,906| tensorflow: global_step/sec: 2.03391
WARNING 2021-07-14 12:15:30,592| driveix.bpnet.dataloaders.processors.label_processor: Limb length is zeo. Skipping part affinity label.
INFO:tensorflow:global_step/sec: 1.99537
INFO    2021-07-14 12:15:38,929| tensorflow: global_step/sec: 1.99537
INFO:tensorflow:global_step/sec: 1.97418
INFO    2021-07-14 12:15:49,060| tensorflow: global_step/sec: 1.97418
INFO:tensorflow:epoch = 49.97622559652928, loss = 313.74838, step = 575976 (31.007 sec)
INFO    2021-07-14 12:15:49,571| tensorflow: epoch = 49.97622559652928, loss = 313.74838, step = 575976 (31.007 sec)
INFO:tensorflow:global_step/sec: 1.94355
INFO    2021-07-14 12:15:59,351| tensorflow: global_step/sec: 1.94355
INFO:tensorflow:global_step/sec: 1.81373
INFO    2021-07-14 12:16:10,378| tensorflow: global_step/sec: 1.81373

Morganh · July 14, 2021, 1:27pm

You are training with one gpu , right?
More, could you share the training yaml file?
If possible, could you double check the steps according to the bpnet jupyter notebook from tlt_cv_samples_v1.1.0.zip (NVIDIA TAO Documentation) ?

neuroSparK · July 14, 2021, 2:02pm

Right, I am training with one T4 GPU, although the model is intended for V100. Here is my training yaml file:

__class_name__: BpNetTrainer
checkpoint_dir: /workspace/tlt-experiments/bpnet/models/exp_m1_unpruned
log_every_n_secs: 30
checkpoint_n_epoch: 5
num_epoch: 100
summary_every_n_steps: 20
infrequent_summary_every_n_steps: 0
validation_every_n_epoch: 5
max_ckpt_to_keep: 100
random_seed: 42
pretrained_weights: /workspace/tlt-experiments/bpnet/pretrained_vgg19/tlt_bodyposenet_vvgg19/vgg_19.hdf5
load_graph: False
finetuning_config:
  is_finetune_exp: False
  checkpoint_path: null
  ckpt_epoch_num: 0
use_stagewise_lr_multipliers: True
dataloader:
  __class_name__: BpNetDataloader
  batch_size: 10
  pose_config:
    __class_name__: BpNetPoseConfig
    target_shape: [32, 32]
    pose_config_path: /workspace/examples/bpnet/model_pose_config/bpnet_18joints.json
  image_config:
    image_dims:
      height: 256
      width: 256
      channels: 3
    image_encoding: jpg
  dataset_config:
    root_data_path: /workspace/tlt-experiments/bpnet/data/
    train_records_folder_path: /workspace/tlt-experiments/bpnet/data
    train_records_path: [train-fold-000-of-001]
    val_records_folder_path: /workspace/tlt-experiments/bpnet/data
    val_records_path: [val-fold-000-of-001]
    dataset_specs:
      coco: /workspace/examples/bpnet/data_pose_config/coco_spec.json
  normalization_params: 
    image_scale: [256.0, 256.0, 256.0]
    image_offset: [0.5, 0.5, 0.5]
    mask_scale: [255.0]
    mask_offset: [0.0]
  augmentation_config:
    __class_name__: AugmentationConfig
    spatial_augmentation_mode: person_centric
    spatial_aug_params:
      flip_lr_prob: 0.5
      flip_tb_prob: 0.0
      rotate_deg_max: 40.0
      rotate_deg_min: -40.0
      zoom_prob: 0.0
      zoom_ratio_min: 1.0
      zoom_ratio_max: 1.0
      translate_max_x: 40.0
      translate_min_x: -40.0
      translate_max_y: 40.0
      translate_min_y: -40.0
      use_translate_ratio: False
      translate_ratio_max: 0.2
      translate_ratio_min: -0.2
      target_person_scale: 0.6
    identity_spatial_aug_params:
      null
  label_processor_config:
    paf_gaussian_sigma: 0.03
    heatmap_gaussian_sigma: 7.0
    paf_ortho_dist_thresh: 1.0
  shuffle_buffer_size: 20000
model:
  __class_name__: BpNetLiteModel
  backbone_attributes:
    architecture: vgg
    mtype: default
    use_bias: False
  stages: 3
  heat_channels: 19
  paf_channels: 38
  use_self_attention: False
  data_format: channels_last
  use_bias: True
  regularization_type: l1
  kernel_regularization_factor: 5.0e-4
  bias_regularization_factor: 0.0
  kernel_initializer: random_normal
optimizer:
  __class_name__: WeightedMomentumOptimizer
  learning_rate_schedule:
    __class_name__: SoftstartAnnealingLearningRateSchedule
    soft_start: 0.05
    annealing: 0.5
    base_learning_rate: 2.e-5
    min_learning_rate: 8.e-08
    last_step: null
  grad_weights_dict: null
  weight_default_value: 1.0
  momentum: 0.9
  use_nesterov: False
loss:
  __class_name__: BpNetLoss

neuroSparK · July 17, 2021, 4:42pm

After 82 epochs, the loss remains the same. Is it associated with bad training config file? @Morganh please help.

INFO:tensorflow:epoch = 82.21648590021692, loss = 327.33383, step = 947545 (30.651 sec)
INFO    2021-07-17 16:40:50,438| tensorflow: epoch = 82.21648590021692, loss = 327.33383, step = 947545 (30.651 sec)

Morganh · July 17, 2021, 4:48pm

If you follow the blog but cannot get its result, I am not sure what happened in your training yet.
I need to follow the blog to train on my side. On your side, please double check, if possible, please refer to the bpnet jupyter notebook as well. Thanks.

neuroSparK · July 17, 2021, 5:07pm

I have checked again, followed everything step by step. The only doubt in me is that the pretrained model is for V100 where’s I am using T4 GPU. Could it be a reason?

Morganh · July 17, 2021, 5:24pm

No, that could not be the reason. The possible way is to tune parameters , such as, annealing, base_learning_rate and batch size.
Anyway, I will try to follow the blog to train.

cschumacher · July 17, 2021, 7:18pm

@neuroSparK -

I was also training bpnet this week and noticed the same thing as you; loss roughly converged to around 300 by 40 epochs. I followed the tutorial exactly and around 100 epochs the loss was jumping between 175-250 at the checkpoints, so there is still some room for improvement after 82 epochs. I expected the results to be poor because that is a high loss value, but when I use the model for inference I am (qualitatively) getting very good results that appear beat several other open source options. Fewer false positives, more keypoints detected, and very fast inference. I haven’t done a full evaluation benchmark yet to confirm results that the tutorial says to expect, but I get identical results on the sample images so far which is a good sign.

Unless I missed it somewhere, the tutorial doesn’t say what we should expect the loss value to be by the end of training or even specify what the exact loss function is (please correct me if I’m wrong). It’s possible that this is in the expected range for the COCO dataset and specific loss used when training bpnet?

neuroSparK · July 18, 2021, 9:09am

Thanks for your valuable info. I’ll check the performance after the training finishes.

system · September 27, 2021, 1:06pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
BodyPoseNet trained with custom dataset not detecting TAO Toolkit	21	851	June 6, 2022
Bpnet dataset_convert error in tao TAO Toolkit	6	511	October 20, 2022
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found TAO Toolkit	11	2469	February 13, 2022
6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying TAO Toolkit	29	2713	February 3, 2022
Retraining BodyPoseNet TAO Toolkit	6	567	August 5, 2022
TAO Toolkit Training Error TAO Toolkit	2	710	August 2, 2022
Tao GestureNet train do not work properly TAO Toolkit	2	672	December 9, 2021
Retraining Gesturenet TAO Toolkit	19	677	July 6, 2022
TLT Detectnet TrafficCamNet training not working TAO Toolkit	10	2485	October 12, 2021
Detectnetv2 wont train if pretrained_model_file is specified. Peoplenet transfer learning TAO Toolkit	11	1006	December 28, 2021

BodyPoseNet training not converging

Related topics