Transfer learning using the cityscapes dataset in unet-peopleSemSegNet causes poor generalization performance

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) : V100
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) : Unet
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)

Configuration of the TAO Toolkit Instance

dockers: 
        nvidia/tao/tao-toolkit-tf: 
                docker_registry: nvcr.io
                docker_tag: v3.21.08-py3
                tasks: 
                        1. augment
                        2. bpnet
                        3. classification
                        4. detectnet_v2
                        5. dssd
                        6. emotionnet
                        7. faster_rcnn
                        8. fpenet
                        9. gazenet
                        10. gesturenet
                        11. heartratenet
                        12. lprnet
                        13. mask_rcnn
                        14. multitask_classification
                        15. retinanet
                        16. ssd
                        17. unet
                        18. yolo_v3
                        19. yolo_v4
                        20. converter
        nvidia/tao/tao-toolkit-pyt: 
                docker_registry: nvcr.io
                docker_tag: v3.21.08-py3
                tasks: 
                        1. speech_to_text
                        2. speech_to_text_citrinet
                        3. text_classification
                        4. question_answering
                        5. token_classification
                        6. intent_slot_classification
                        7. punctuation_and_capitalization
        nvidia/tao/tao-toolkit-lm: 
                docker_registry: nvcr.io
                docker_tag: v3.21.08-py3
                tasks: 
                        1. n_gram
format_version: 1.0
toolkit_version: 3.21.08
published_date: 08/17/2021

• Training spec file(If have, please share here)

random_seed: 42
model_config {
  num_layers: 18
  model_input_width: 960
  model_input_height: 544
  model_input_channels: 3
  all_projections: true
  arch: "vanilla_unet_dynamic"
  use_batch_norm: true
  training_precision {
    backend_floatx: FLOAT32
  }
}

training_config {
  batch_size: 3
  epochs: 50
  log_summary_steps: 10
  checkpoint_interval: 1
  loss: "cross_dice_sum"
  learning_rate:0.0001
  regularizer {
    type: L2
    weight: 2e-5
  }
  optimizer {
    adam {
      epsilon: 9.99999993923e-09
      beta1: 0.899999976158
      beta2: 0.999000012875
    }
  }
}

dataset_config {
  dataset: "custom"
  augment: False
  augmentation_config {
    spatial_augmentation {
    hflip_probability : 0.5
    vflip_probability : 0.5
    crop_and_resize_prob : 0.5
  }
  brightness_augmentation {
    delta: 0.2
  }
}
input_image_type: "color"
train_images_path:"/workspace/tao-experiments/data/images/train"
train_masks_path:"/workspace/tao-experiments/data/masks/train"

val_images_path:"/workspace/tao-experiments/data/images/val"
val_masks_path:"/workspace/tao-experiments/data/masks/val"

test_images_path:"/workspace/tao-experiments/data/images/test"

data_class_config {
  target_classes {
    name: "background"
    mapping_class: "background"
    label_id: 0
  }
  target_classes {
    name: "person"
    mapping_class: "person"
    label_id: 255
  }
}
}

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

  1. I want to segment people in CCTV footage.

  2. So I created a mask file by isolating only people from the Cityscapes dataset.

  3. Then, the dataset was trained using tao-peoplesemsegnet as the pretrained-model.

  4. The indicator has successfully improved (compare 0 epoch to 5 epoch). However, it does not recognize a person in the CCTV video. (Is it because the person in the video is too small?)

  5. When the trained model is verified using the Cityscapes test set, the results are good. So I suspected overfitting.

  6. I want a model that can partition even small people (but for the Cityscpaes test data you can see that it also partitions small people).

  7. I do not have a separate dataset for the task. (CCTV data)

  8. Any suggestions on how to solve this problem?

  9. Additionally, the training log is: At 5 epoch I also get a NanLossDuringTrainingError issue.

For multi-GPU, change --gpus based on your machine.
2021-10-26 03:41:48,941 [INFO] root: Registry: ['nvcr.io']
2021-10-26 03:41:49,170 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ubuntu/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/checkpoint_saver_hook.py:21: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/pretrained_restore_hook.py:23: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/pretrained_restore_hook.py:23: The name tf.logging.WARN is deprecated. Please use tf.compat.v1.logging.WARN instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py:405: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

Loading experiment spec at /home/ubuntu/tao/unet/specs/unet_train_resnet_unet_isbi.txt.
2021-10-26 03:42:06,411 [INFO] __main__: Loading experiment spec at /home/ubuntu/tao/unet/specs/unet_train_resnet_unet_isbi.txt.
2021-10-26 03:42:06,414 [INFO] iva.unet.spec_handler.spec_loader: Merging specification from /home/ubuntu/tao/unet/specs/unet_train_resnet_unet_isbi.txt
2021-10-26 03:42:06,417 [INFO] root: Initializing the pre-trained weights from /workspace/tao-experiments/unet/test_unpruned/weights/peoplesemsegnet.tlt
2021-10-26 03:42:06,417 [INFO] iva.unet.model.utilities: Loading weights from /workspace/tao-experiments/unet/test_unpruned/weights/peoplesemsegnet.tlt
2021-10-26 03:42:17,753 [INFO] iva.unet.model.utilities: Label Id 0: Train Id 0
2021-10-26 03:42:17,753 [INFO] iva.unet.model.utilities: Label Id 255: Train Id 1
2021-10-26 03:42:17,755 [INFO] iva.unet.hooks.latest_checkpoint: Getting the latest checkpoint for restoring /workspace/tao-experiments/unet/test_unpruned/model.step-4960.tlt
INFO:tensorflow:Using config: {'_model_dir': '/workspace/tao-experiments/unet/test_unpruned', '_tf_random_seed': None, '_save_summary_steps': 5, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 38
gpu_options {
  allow_growth: true
  visible_device_list: "0"
  force_gpu_compatible: true
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f36f6db77f0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
2021-10-26 03:42:22,103 [INFO] tensorflow: Using config: {'_model_dir': '/workspace/tao-experiments/unet/test_unpruned', '_tf_random_seed': None, '_save_summary_steps': 5, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 38
gpu_options {
  allow_growth: true
  visible_device_list: "0"
  force_gpu_compatible: true
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f36f6db77f0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

Phase train: Total 2975 files.
2021-10-26 03:42:22,195 [INFO] iva.unet.model.utilities: The total number of training samples 2975 and the batch size per                 GPU 3
2021-10-26 03:42:22,195 [INFO] iva.unet.model.utilities: Cannot iterate over exactly 2975 samples with a batch size of 3; each epoch will therefore take one extra step.
2021-10-26 03:42:22,195 [INFO] iva.unet.model.utilities: Steps per epoch taken: 992
Running for 50 Epochs
2021-10-26 03:42:22,195 [INFO] __main__: Running for 50 Epochs
INFO:tensorflow:Create CheckpointSaverHook.
2021-10-26 03:42:22,195 [INFO] tensorflow: Create CheckpointSaverHook.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

2021-10-26 03:42:23,113 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:Entity <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-10-26 03:42:23,157 [WARNING] tensorflow: Entity <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f368e19b048> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f368e19b048>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-10-26 03:42:23,177 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f368e19b048> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f368e19b048>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

2021-10-26 03:42:23,180 [WARNING] tensorflow: 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

/opt/nvidia/third_party/keras/tensorflow_backend.py:356: UserWarning: Creating resources inside a function passed to Dataset.map() is not supported. Create each resource outside the function, and capture it inside the function to use it.
  self, _map_func_set_random_wrapper, num_parallel_calls=num_parallel_calls
WARNING:tensorflow:Entity <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-10-26 03:42:23,190 [WARNING] tensorflow: Entity <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-10-26 03:42:23,200 [WARNING] tensorflow: Entity <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-10-26 03:42:23,209 [WARNING] tensorflow: Entity <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/utils/data_loader.py:414: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

2021-10-26 03:42:23,210 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/utils/data_loader.py:414: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f35554e82f0> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f35554e82f0>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-10-26 03:42:23,224 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f35554e82f0> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f35554e82f0>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f35554e8c80> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f35554e8c80>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-10-26 03:42:23,310 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f35554e8c80> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f35554e8c80>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-10-26 03:42:23,317 [WARNING] tensorflow: Entity <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f355550c048> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f355550c048>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-10-26 03:42:23,326 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f355550c048> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f355550c048>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO:tensorflow:Calling model_fn.
2021-10-26 03:42:23,347 [INFO] tensorflow: Calling model_fn.
2021-10-26 03:42:23,347 [INFO] iva.unet.utils.model_fn: {'exec_mode': 'train', 'model_dir': '/workspace/tao-experiments/unet/test_unpruned', 'resize_padding': False, 'resize_method': 'BILINEAR', 'log_dir': None, 'batch_size': 3, 'learning_rate': 9.999999747378752e-05, 'crossvalidation_idx': None, 'max_steps': None, 'regularizer_type': 2, 'weight_decay': 1.9999999494757503e-05, 'log_summary_steps': 10, 'warmup_steps': 0, 'augment': False, 'use_amp': False, 'use_trt': False, 'use_xla': False, 'loss': 'cross_dice_sum', 'epochs': 50, 'pretrained_weights_file': None, 'unet_model': <iva.unet.model.vanilla_unet_dynamic.VanillaUnetDynamic object at 0x7f368e193b38>, 'key': 'tlt_encode', 'experiment_spec': random_seed: 42
dataset_config {
  dataset: "custom"
  input_image_type: "color"
  train_images_path: "/workspace/tao-experiments/data/images/train"
  train_masks_path: "/workspace/tao-experiments/data/masks/train"
  val_images_path: "/workspace/tao-experiments/data/images/val"
  val_masks_path: "/workspace/tao-experiments/data/masks/val"
  test_images_path: "/workspace/tao-experiments/data/images/test"
  data_class_config {
    target_classes {
      name: "background"
      mapping_class: "background"
    }
    target_classes {
      name: "person"
      label_id: 255
      mapping_class: "person"
    }
  }
  augmentation_config {
    spatial_augmentation {
      hflip_probability: 0.5
      vflip_probability: 0.5
      crop_and_resize_prob: 0.5
    }
    brightness_augmentation {
      delta: 0.20000000298023224
    }
  }
}
model_config {
  num_layers: 18
  use_batch_norm: true
  training_precision {
    backend_floatx: FLOAT32
  }
  arch: "vanilla_unet_dynamic"
  all_projections: true
  model_input_height: 544
  model_input_width: 960
  model_input_channels: 3
}
training_config {
  batch_size: 3
  regularizer {
    type: L2
    weight: 1.9999999494757503e-05
  }
  optimizer {
    adam {
      epsilon: 9.99999993922529e-09
      beta1: 0.8999999761581421
      beta2: 0.9990000128746033
    }
  }
  checkpoint_interval: 1
  log_summary_steps: 10
  learning_rate: 9.999999747378752e-05
  loss: "cross_dice_sum"
  epochs: 50
}
, 'seed': 42, 'benchmark': False, 'temp_dir': '/tmp/tmpedlz_8q1', 'num_classes': 2, 'start_step': 4960, 'checkpoint_interval': 1, 'model_json': None, 'load_graph': False, 'phase': None}
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

2021-10-26 03:42:23,348 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

2021-10-26 03:42:23,349 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

2021-10-26 03:42:23,375 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

2021-10-26 03:42:23,381 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

2021-10-26 03:42:23,438 [WARNING] tensorflow: From /opt/nvidia/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 3, 544, 960)  0                                            
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 32, 544, 960) 896         input_1[0][0]                    
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 32, 544, 960) 128         conv2d_1[0][0]                   
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 32, 544, 960) 0           batch_normalization_1[0][0]      
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, 32, 544, 960) 9248        activation_1[0][0]               
__________________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, 32, 544, 960) 128         conv2d_2[0][0]                   
__________________________________________________________________________________________________
activation_2 (Activation)       (None, 32, 544, 960) 0           batch_normalization_2[0][0]      
__________________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D)  (None, 32, 272, 480) 0           activation_2[0][0]               
__________________________________________________________________________________________________
conv2d_3 (Conv2D)               (None, 64, 272, 480) 18496       max_pooling2d_1[0][0]            
__________________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, 64, 272, 480) 256         conv2d_3[0][0]                   
__________________________________________________________________________________________________
activation_3 (Activation)       (None, 64, 272, 480) 0           batch_normalization_3[0][0]      
__________________________________________________________________________________________________
conv2d_4 (Conv2D)               (None, 64, 272, 480) 36928       activation_3[0][0]               
__________________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, 64, 272, 480) 256         conv2d_4[0][0]                   
__________________________________________________________________________________________________
activation_4 (Activation)       (None, 64, 272, 480) 0           batch_normalization_4[0][0]      
__________________________________________________________________________________________________
max_pooling2d_2 (MaxPooling2D)  (None, 64, 136, 240) 0           activation_4[0][0]               
__________________________________________________________________________________________________
conv2d_5 (Conv2D)               (None, 128, 136, 240 73856       max_pooling2d_2[0][0]            
__________________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, 128, 136, 240 512         conv2d_5[0][0]                   
__________________________________________________________________________________________________
activation_5 (Activation)       (None, 128, 136, 240 0           batch_normalization_5[0][0]      
__________________________________________________________________________________________________
conv2d_6 (Conv2D)               (None, 128, 136, 240 147584      activation_5[0][0]               
__________________________________________________________________________________________________
batch_normalization_6 (BatchNor (None, 128, 136, 240 512         conv2d_6[0][0]                   
__________________________________________________________________________________________________
activation_6 (Activation)       (None, 128, 136, 240 0           batch_normalization_6[0][0]      
__________________________________________________________________________________________________
max_pooling2d_3 (MaxPooling2D)  (None, 128, 68, 120) 0           activation_6[0][0]               
__________________________________________________________________________________________________
conv2d_7 (Conv2D)               (None, 256, 68, 120) 295168      max_pooling2d_3[0][0]            
__________________________________________________________________________________________________
batch_normalization_7 (BatchNor (None, 256, 68, 120) 1024        conv2d_7[0][0]                   
__________________________________________________________________________________________________
activation_7 (Activation)       (None, 256, 68, 120) 0           batch_normalization_7[0][0]      
__________________________________________________________________________________________________
conv2d_8 (Conv2D)               (None, 256, 68, 120) 590080      activation_7[0][0]               
__________________________________________________________________________________________________
batch_normalization_8 (BatchNor (None, 256, 68, 120) 1024        conv2d_8[0][0]                   
__________________________________________________________________________________________________
activation_8 (Activation)       (None, 256, 68, 120) 0           batch_normalization_8[0][0]      
__________________________________________________________________________________________________
max_pooling2d_4 (MaxPooling2D)  (None, 256, 34, 60)  0           activation_8[0][0]               
__________________________________________________________________________________________________
conv2d_9 (Conv2D)               (None, 512, 34, 60)  1180160     max_pooling2d_4[0][0]            
__________________________________________________________________________________________________
batch_normalization_9 (BatchNor (None, 512, 34, 60)  2048        conv2d_9[0][0]                   
__________________________________________________________________________________________________
activation_9 (Activation)       (None, 512, 34, 60)  0           batch_normalization_9[0][0]      
__________________________________________________________________________________________________
conv2d_10 (Conv2D)              (None, 512, 34, 60)  2359808     activation_9[0][0]               
__________________________________________________________________________________________________
batch_normalization_10 (BatchNo (None, 512, 34, 60)  2048        conv2d_10[0][0]                  
__________________________________________________________________________________________________
activation_10 (Activation)      (None, 512, 34, 60)  0           batch_normalization_10[0][0]     
__________________________________________________________________________________________________
max_pooling2d_5 (MaxPooling2D)  (None, 512, 17, 30)  0           activation_10[0][0]              
__________________________________________________________________________________________________
conv2d_11 (Conv2D)              (None, 1024, 17, 30) 4719616     max_pooling2d_5[0][0]            
__________________________________________________________________________________________________
batch_normalization_11 (BatchNo (None, 1024, 17, 30) 4096        conv2d_11[0][0]                  
__________________________________________________________________________________________________
activation_11 (Activation)      (None, 1024, 17, 30) 0           batch_normalization_11[0][0]     
__________________________________________________________________________________________________
conv2d_12 (Conv2D)              (None, 1024, 17, 30) 9438208     activation_11[0][0]              
__________________________________________________________________________________________________
batch_normalization_12 (BatchNo (None, 1024, 17, 30) 4096        conv2d_12[0][0]                  
__________________________________________________________________________________________________
activation_12 (Activation)      (None, 1024, 17, 30) 0           batch_normalization_12[0][0]     
__________________________________________________________________________________________________
conv2d_transpose_1 (Conv2DTrans (None, 512, 34, 60)  2097664     activation_12[0][0]              
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 1024, 34, 60) 0           activation_10[0][0]              
                                                                 conv2d_transpose_1[0][0]         
__________________________________________________________________________________________________
batch_normalization_13 (BatchNo (None, 1024, 34, 60) 4096        concatenate_1[0][0]              
__________________________________________________________________________________________________
activation_13 (Activation)      (None, 1024, 34, 60) 0           batch_normalization_13[0][0]     
__________________________________________________________________________________________________
conv2d_13 (Conv2D)              (None, 512, 34, 60)  4719104     activation_13[0][0]              
__________________________________________________________________________________________________
batch_normalization_14 (BatchNo (None, 512, 34, 60)  2048        conv2d_13[0][0]                  
__________________________________________________________________________________________________
activation_14 (Activation)      (None, 512, 34, 60)  0           batch_normalization_14[0][0]     
__________________________________________________________________________________________________
conv2d_14 (Conv2D)              (None, 512, 34, 60)  2359808     activation_14[0][0]              
__________________________________________________________________________________________________
batch_normalization_15 (BatchNo (None, 512, 34, 60)  2048        conv2d_14[0][0]                  
__________________________________________________________________________________________________
activation_15 (Activation)      (None, 512, 34, 60)  0           batch_normalization_15[0][0]     
__________________________________________________________________________________________________
conv2d_transpose_2 (Conv2DTrans (None, 256, 68, 120) 524544      activation_15[0][0]              
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (None, 512, 68, 120) 0           activation_8[0][0]               
                                                                 conv2d_transpose_2[0][0]         
__________________________________________________________________________________________________
batch_normalization_16 (BatchNo (None, 512, 68, 120) 2048        concatenate_2[0][0]              
__________________________________________________________________________________________________
activation_16 (Activation)      (None, 512, 68, 120) 0           batch_normalization_16[0][0]     
__________________________________________________________________________________________________
conv2d_15 (Conv2D)              (None, 256, 68, 120) 1179904     activation_16[0][0]              
__________________________________________________________________________________________________
batch_normalization_17 (BatchNo (None, 256, 68, 120) 1024        conv2d_15[0][0]                  
__________________________________________________________________________________________________
activation_17 (Activation)      (None, 256, 68, 120) 0           batch_normalization_17[0][0]     
__________________________________________________________________________________________________
conv2d_16 (Conv2D)              (None, 256, 68, 120) 590080      activation_17[0][0]              
__________________________________________________________________________________________________
batch_normalization_18 (BatchNo (None, 256, 68, 120) 1024        conv2d_16[0][0]                  
__________________________________________________________________________________________________
activation_18 (Activation)      (None, 256, 68, 120) 0           batch_normalization_18[0][0]     
__________________________________________________________________________________________________
conv2d_transpose_3 (Conv2DTrans (None, 128, 136, 240 131200      activation_18[0][0]              
__________________________________________________________________________________________________
concatenate_3 (Concatenate)     (None, 256, 136, 240 0           activation_6[0][0]               
                                                                 conv2d_transpose_3[0][0]         
__________________________________________________________________________________________________
batch_normalization_19 (BatchNo (None, 256, 136, 240 1024        concatenate_3[0][0]              
__________________________________________________________________________________________________
activation_19 (Activation)      (None, 256, 136, 240 0           batch_normalization_19[0][0]     
__________________________________________________________________________________________________
conv2d_17 (Conv2D)              (None, 128, 136, 240 295040      activation_19[0][0]              
__________________________________________________________________________________________________
batch_normalization_20 (BatchNo (None, 128, 136, 240 512         conv2d_17[0][0]                  
__________________________________________________________________________________________________
activation_20 (Activation)      (None, 128, 136, 240 0           batch_normalization_20[0][0]     
__________________________________________________________________________________________________
conv2d_18 (Conv2D)              (None, 128, 136, 240 147584      activation_20[0][0]              
__________________________________________________________________________________________________
batch_normalization_21 (BatchNo (None, 128, 136, 240 512         conv2d_18[0][0]                  
__________________________________________________________________________________________________
activation_21 (Activation)      (None, 128, 136, 240 0           batch_normalization_21[0][0]     
__________________________________________________________________________________________________
conv2d_transpose_4 (Conv2DTrans (None, 64, 272, 480) 32832       activation_21[0][0]              
__________________________________________________________________________________________________
concatenate_4 (Concatenate)     (None, 128, 272, 480 0           activation_4[0][0]               
                                                                 conv2d_transpose_4[0][0]         
__________________________________________________________________________________________________
batch_normalization_22 (BatchNo (None, 128, 272, 480 512         concatenate_4[0][0]              
__________________________________________________________________________________________________
activation_22 (Activation)      (None, 128, 272, 480 0           batch_normalization_22[0][0]     
__________________________________________________________________________________________________
conv2d_19 (Conv2D)              (None, 64, 272, 480) 73792       activation_22[0][0]              
__________________________________________________________________________________________________
batch_normalization_23 (BatchNo (None, 64, 272, 480) 256         conv2d_19[0][0]                  
__________________________________________________________________________________________________
activation_23 (Activation)      (None, 64, 272, 480) 0           batch_normalization_23[0][0]     
__________________________________________________________________________________________________
conv2d_20 (Conv2D)              (None, 64, 272, 480) 36928       activation_23[0][0]              
__________________________________________________________________________________________________
batch_normalization_24 (BatchNo (None, 64, 272, 480) 256         conv2d_20[0][0]                  
__________________________________________________________________________________________________
activation_24 (Activation)      (None, 64, 272, 480) 0           batch_normalization_24[0][0]     
__________________________________________________________________________________________________
conv2d_transpose_5 (Conv2DTrans (None, 32, 544, 960) 8224        activation_24[0][0]              
__________________________________________________________________________________________________
concatenate_5 (Concatenate)     (None, 64, 544, 960) 0           activation_2[0][0]               
                                                                 conv2d_transpose_5[0][0]         
__________________________________________________________________________________________________
batch_normalization_25 (BatchNo (None, 64, 544, 960) 256         concatenate_5[0][0]              
__________________________________________________________________________________________________
activation_25 (Activation)      (None, 64, 544, 960) 0           batch_normalization_25[0][0]     
__________________________________________________________________________________________________
conv2d_21 (Conv2D)              (None, 32, 544, 960) 18464       activation_25[0][0]              
__________________________________________________________________________________________________
batch_normalization_26 (BatchNo (None, 32, 544, 960) 128         conv2d_21[0][0]                  
__________________________________________________________________________________________________
activation_26 (Activation)      (None, 32, 544, 960) 0           batch_normalization_26[0][0]     
__________________________________________________________________________________________________
conv2d_22 (Conv2D)              (None, 32, 544, 960) 9248        activation_26[0][0]              
__________________________________________________________________________________________________
batch_normalization_27 (BatchNo (None, 32, 544, 960) 128         conv2d_22[0][0]                  
__________________________________________________________________________________________________
activation_27 (Activation)      (None, 32, 544, 960) 0           batch_normalization_27[0][0]     
__________________________________________________________________________________________________
conv2d_23 (Conv2D)              (None, 2, 544, 960)  66          activation_27[0][0]              
==================================================================================================
Total params: 31,126,530
Trainable params: 31,110,530
Non-trainable params: 16,000
__________________________________________________________________________________________________
INFO:tensorflow:Done calling model_fn.
2021-10-26 03:42:29,115 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2021-10-26 03:42:32,152 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2021-10-26 03:42:34,197 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2021-10-26 03:42:34,367 [INFO] tensorflow: Done running local_init_op.
[GPU] Restoring pretrained weights from: /tmp/tmpw0l7bx6m/model.ckpt-4960
2021-10-26 03:42:35,517 [INFO] iva.unet.hooks.pretrained_restore_hook: Pretrained weights loaded with success...

INFO:tensorflow:Saving checkpoints for step-4960.
2021-10-26 03:42:41,262 [INFO] tensorflow: Saving checkpoints for step-4960.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:95: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2021-10-26 03:42:48,403 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:95: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

Epoch: 5/50:, Cur-Step: 4960, loss(cross_dice_sum): 0.01901, Running average loss:0.01901, Time taken: 0:00:00 ETA: 0:00:00
2021-10-26 03:43:01,564 [INFO] __main__: Epoch: 5/50:, Cur-Step: 4960, loss(cross_dice_sum): 0.01901, Running average loss:0.01901, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 5/50:, Cur-Step: 4970, loss(cross_dice_sum): 0.00755, Running average loss:0.01389, Time taken: 0:00:00 ETA: 0:00:00
2021-10-26 03:43:10,670 [INFO] __main__: Epoch: 5/50:, Cur-Step: 4970, loss(cross_dice_sum): 0.00755, Running average loss:0.01389, Time taken: 0:00:00 ETA: 0:00:00
ERROR:tensorflow:Model diverged with loss = NaN.
2021-10-26 03:43:12,717 [ERROR] tensorflow: Model diverged with loss = NaN.
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 419, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 413, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 314, in run_experiment
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 229, in train_unet
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 105, in run_training_loop
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1426, in run
    run_metadata=run_metadata))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/basic_session_run_hooks.py", line 761, in after_run
    raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.
2021-10-26 03:43:16,227 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

For the CCTV data, is it jpg image or png image?

png data.

In your training spec file,
The train_images uses Cityscapes dataset.
The val_images also use Cityscapes dataset.
The test images use CCTV dataset.

Am I correct?

Yes, that’s right. I thought the test image was simply for verification, so I used both CCTV data (there are only 3) and the CITYSCAPES test set.

The 3 images of CCTV data seems to have different data distribution from Cityscapes dataset.
Suggest to collect more images from CCTV dataset and generate mask as well.

1 Like

All right.
The performance of the pretrained-model is good, so I expected that the performance could be improved a little more by using the Cityscapes dataset, which feels similar to CCTV footage.
Is it correct that the training itself proceeds without any problems? How can I fix NanLossDuringTrainingError during training?

Do you mean the pretrained-model can run inference well against the 3 images from CCTV?

Yes, the peoplesemsegnet tlt model showed the following inference results. What I’m hoping for here is to get better performance through transfer learning.


image

OK, got it. The peoplesemsegnet tlt model, mentioned in model card (NVIDIA NGC) , was trained on a proprietary dataset with more than 5 million objects for person class. The training dataset consists of a mix of camera heights, crowd-density, and field-of view (FOV). Approximately half of the training data consisted of images captured in an indoor office environment.

For your case, it is recommended to collect more CCTV data for training.

For NaN issue, please try lower batch size.

For improving accuracy, refer to Problems encountered in training unet and inference unet - #27 by Morganh, you can also use

  • loss: “cross_entropy”
  • weight: 2e-06
  • crop_and_resize_prob : 0.01
1 Like

I don’t know collect CCTV data will be available, but if there is any progress, I’ll let you know in a reply. We will also change the loss you mentioned. Thanks for the reply, Morganh!

Well, as you suggested, I changed the loss, weight, and crop_and_resize_prob, and as a result, learning progressed up to 32 epochs. However, it still performs poorly for CCTV footage. Guess I’ll have to get an additional dataset.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.