Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc) : V100
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) : Unet
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
Configuration of the TAO Toolkit Instance
dockers:
nvidia/tao/tao-toolkit-tf:
docker_registry: nvcr.io
docker_tag: v3.21.08-py3
tasks:
1. augment
2. bpnet
3. classification
4. detectnet_v2
5. dssd
6. emotionnet
7. faster_rcnn
8. fpenet
9. gazenet
10. gesturenet
11. heartratenet
12. lprnet
13. mask_rcnn
14. multitask_classification
15. retinanet
16. ssd
17. unet
18. yolo_v3
19. yolo_v4
20. converter
nvidia/tao/tao-toolkit-pyt:
docker_registry: nvcr.io
docker_tag: v3.21.08-py3
tasks:
1. speech_to_text
2. speech_to_text_citrinet
3. text_classification
4. question_answering
5. token_classification
6. intent_slot_classification
7. punctuation_and_capitalization
nvidia/tao/tao-toolkit-lm:
docker_registry: nvcr.io
docker_tag: v3.21.08-py3
tasks:
1. n_gram
format_version: 1.0
toolkit_version: 3.21.08
published_date: 08/17/2021
• Training spec file(If have, please share here)
random_seed: 42
model_config {
num_layers: 18
model_input_width: 960
model_input_height: 544
model_input_channels: 3
all_projections: true
arch: "vanilla_unet_dynamic"
use_batch_norm: true
training_precision {
backend_floatx: FLOAT32
}
}
training_config {
batch_size: 3
epochs: 50
log_summary_steps: 10
checkpoint_interval: 1
loss: "cross_dice_sum"
learning_rate:0.0001
regularizer {
type: L2
weight: 2e-5
}
optimizer {
adam {
epsilon: 9.99999993923e-09
beta1: 0.899999976158
beta2: 0.999000012875
}
}
}
dataset_config {
dataset: "custom"
augment: False
augmentation_config {
spatial_augmentation {
hflip_probability : 0.5
vflip_probability : 0.5
crop_and_resize_prob : 0.5
}
brightness_augmentation {
delta: 0.2
}
}
input_image_type: "color"
train_images_path:"/workspace/tao-experiments/data/images/train"
train_masks_path:"/workspace/tao-experiments/data/masks/train"
val_images_path:"/workspace/tao-experiments/data/images/val"
val_masks_path:"/workspace/tao-experiments/data/masks/val"
test_images_path:"/workspace/tao-experiments/data/images/test"
data_class_config {
target_classes {
name: "background"
mapping_class: "background"
label_id: 0
}
target_classes {
name: "person"
mapping_class: "person"
label_id: 255
}
}
}
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
-
I want to segment people in CCTV footage.
-
So I created a mask file by isolating only people from the Cityscapes dataset.
-
Then, the dataset was trained using tao-peoplesemsegnet as the pretrained-model.
-
The indicator has successfully improved (compare 0 epoch to 5 epoch). However, it does not recognize a person in the CCTV video. (Is it because the person in the video is too small?)
-
When the trained model is verified using the Cityscapes test set, the results are good. So I suspected overfitting.
-
I want a model that can partition even small people (but for the Cityscpaes test data you can see that it also partitions small people).
-
I do not have a separate dataset for the task. (CCTV data)
-
Any suggestions on how to solve this problem?
-
Additionally, the training log is: At 5 epoch I also get a NanLossDuringTrainingError issue.
For multi-GPU, change --gpus based on your machine.
2021-10-26 03:41:48,941 [INFO] root: Registry: ['nvcr.io']
2021-10-26 03:41:49,170 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ubuntu/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/checkpoint_saver_hook.py:21: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/pretrained_restore_hook.py:23: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/pretrained_restore_hook.py:23: The name tf.logging.WARN is deprecated. Please use tf.compat.v1.logging.WARN instead.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py:405: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
Loading experiment spec at /home/ubuntu/tao/unet/specs/unet_train_resnet_unet_isbi.txt.
2021-10-26 03:42:06,411 [INFO] __main__: Loading experiment spec at /home/ubuntu/tao/unet/specs/unet_train_resnet_unet_isbi.txt.
2021-10-26 03:42:06,414 [INFO] iva.unet.spec_handler.spec_loader: Merging specification from /home/ubuntu/tao/unet/specs/unet_train_resnet_unet_isbi.txt
2021-10-26 03:42:06,417 [INFO] root: Initializing the pre-trained weights from /workspace/tao-experiments/unet/test_unpruned/weights/peoplesemsegnet.tlt
2021-10-26 03:42:06,417 [INFO] iva.unet.model.utilities: Loading weights from /workspace/tao-experiments/unet/test_unpruned/weights/peoplesemsegnet.tlt
2021-10-26 03:42:17,753 [INFO] iva.unet.model.utilities: Label Id 0: Train Id 0
2021-10-26 03:42:17,753 [INFO] iva.unet.model.utilities: Label Id 255: Train Id 1
2021-10-26 03:42:17,755 [INFO] iva.unet.hooks.latest_checkpoint: Getting the latest checkpoint for restoring /workspace/tao-experiments/unet/test_unpruned/model.step-4960.tlt
INFO:tensorflow:Using config: {'_model_dir': '/workspace/tao-experiments/unet/test_unpruned', '_tf_random_seed': None, '_save_summary_steps': 5, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 38
gpu_options {
allow_growth: true
visible_device_list: "0"
force_gpu_compatible: true
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f36f6db77f0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
2021-10-26 03:42:22,103 [INFO] tensorflow: Using config: {'_model_dir': '/workspace/tao-experiments/unet/test_unpruned', '_tf_random_seed': None, '_save_summary_steps': 5, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 38
gpu_options {
allow_growth: true
visible_device_list: "0"
force_gpu_compatible: true
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f36f6db77f0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Phase train: Total 2975 files.
2021-10-26 03:42:22,195 [INFO] iva.unet.model.utilities: The total number of training samples 2975 and the batch size per GPU 3
2021-10-26 03:42:22,195 [INFO] iva.unet.model.utilities: Cannot iterate over exactly 2975 samples with a batch size of 3; each epoch will therefore take one extra step.
2021-10-26 03:42:22,195 [INFO] iva.unet.model.utilities: Steps per epoch taken: 992
Running for 50 Epochs
2021-10-26 03:42:22,195 [INFO] __main__: Running for 50 Epochs
INFO:tensorflow:Create CheckpointSaverHook.
2021-10-26 03:42:22,195 [INFO] tensorflow: Create CheckpointSaverHook.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.
2021-10-26 03:42:23,113 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.
WARNING:tensorflow:Entity <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-10-26 03:42:23,157 [WARNING] tensorflow: Entity <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f368e19b048> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f368e19b048>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-10-26 03:42:23,177 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f368e19b048> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f368e19b048>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
2021-10-26 03:42:23,180 [WARNING] tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
/opt/nvidia/third_party/keras/tensorflow_backend.py:356: UserWarning: Creating resources inside a function passed to Dataset.map() is not supported. Create each resource outside the function, and capture it inside the function to use it.
self, _map_func_set_random_wrapper, num_parallel_calls=num_parallel_calls
WARNING:tensorflow:Entity <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-10-26 03:42:23,190 [WARNING] tensorflow: Entity <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-10-26 03:42:23,200 [WARNING] tensorflow: Entity <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-10-26 03:42:23,209 [WARNING] tensorflow: Entity <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/utils/data_loader.py:414: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.
2021-10-26 03:42:23,210 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/utils/data_loader.py:414: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f35554e82f0> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f35554e82f0>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-10-26 03:42:23,224 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f35554e82f0> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f35554e82f0>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f35554e8c80> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f35554e8c80>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-10-26 03:42:23,310 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f35554e8c80> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f35554e8c80>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-10-26 03:42:23,317 [WARNING] tensorflow: Entity <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7f36f6db7978>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f355550c048> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f355550c048>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-10-26 03:42:23,326 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f355550c048> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f355550c048>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO:tensorflow:Calling model_fn.
2021-10-26 03:42:23,347 [INFO] tensorflow: Calling model_fn.
2021-10-26 03:42:23,347 [INFO] iva.unet.utils.model_fn: {'exec_mode': 'train', 'model_dir': '/workspace/tao-experiments/unet/test_unpruned', 'resize_padding': False, 'resize_method': 'BILINEAR', 'log_dir': None, 'batch_size': 3, 'learning_rate': 9.999999747378752e-05, 'crossvalidation_idx': None, 'max_steps': None, 'regularizer_type': 2, 'weight_decay': 1.9999999494757503e-05, 'log_summary_steps': 10, 'warmup_steps': 0, 'augment': False, 'use_amp': False, 'use_trt': False, 'use_xla': False, 'loss': 'cross_dice_sum', 'epochs': 50, 'pretrained_weights_file': None, 'unet_model': <iva.unet.model.vanilla_unet_dynamic.VanillaUnetDynamic object at 0x7f368e193b38>, 'key': 'tlt_encode', 'experiment_spec': random_seed: 42
dataset_config {
dataset: "custom"
input_image_type: "color"
train_images_path: "/workspace/tao-experiments/data/images/train"
train_masks_path: "/workspace/tao-experiments/data/masks/train"
val_images_path: "/workspace/tao-experiments/data/images/val"
val_masks_path: "/workspace/tao-experiments/data/masks/val"
test_images_path: "/workspace/tao-experiments/data/images/test"
data_class_config {
target_classes {
name: "background"
mapping_class: "background"
}
target_classes {
name: "person"
label_id: 255
mapping_class: "person"
}
}
augmentation_config {
spatial_augmentation {
hflip_probability: 0.5
vflip_probability: 0.5
crop_and_resize_prob: 0.5
}
brightness_augmentation {
delta: 0.20000000298023224
}
}
}
model_config {
num_layers: 18
use_batch_norm: true
training_precision {
backend_floatx: FLOAT32
}
arch: "vanilla_unet_dynamic"
all_projections: true
model_input_height: 544
model_input_width: 960
model_input_channels: 3
}
training_config {
batch_size: 3
regularizer {
type: L2
weight: 1.9999999494757503e-05
}
optimizer {
adam {
epsilon: 9.99999993922529e-09
beta1: 0.8999999761581421
beta2: 0.9990000128746033
}
}
checkpoint_interval: 1
log_summary_steps: 10
learning_rate: 9.999999747378752e-05
loss: "cross_dice_sum"
epochs: 50
}
, 'seed': 42, 'benchmark': False, 'temp_dir': '/tmp/tmpedlz_8q1', 'num_classes': 2, 'start_step': 4960, 'checkpoint_interval': 1, 'model_json': None, 'load_graph': False, 'phase': None}
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
2021-10-26 03:42:23,348 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.
2021-10-26 03:42:23,349 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.
2021-10-26 03:42:23,375 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.
2021-10-26 03:42:23,381 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.
WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.
2021-10-26 03:42:23,438 [WARNING] tensorflow: From /opt/nvidia/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) (None, 3, 544, 960) 0
__________________________________________________________________________________________________
conv2d_1 (Conv2D) (None, 32, 544, 960) 896 input_1[0][0]
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 32, 544, 960) 128 conv2d_1[0][0]
__________________________________________________________________________________________________
activation_1 (Activation) (None, 32, 544, 960) 0 batch_normalization_1[0][0]
__________________________________________________________________________________________________
conv2d_2 (Conv2D) (None, 32, 544, 960) 9248 activation_1[0][0]
__________________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, 32, 544, 960) 128 conv2d_2[0][0]
__________________________________________________________________________________________________
activation_2 (Activation) (None, 32, 544, 960) 0 batch_normalization_2[0][0]
__________________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D) (None, 32, 272, 480) 0 activation_2[0][0]
__________________________________________________________________________________________________
conv2d_3 (Conv2D) (None, 64, 272, 480) 18496 max_pooling2d_1[0][0]
__________________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, 64, 272, 480) 256 conv2d_3[0][0]
__________________________________________________________________________________________________
activation_3 (Activation) (None, 64, 272, 480) 0 batch_normalization_3[0][0]
__________________________________________________________________________________________________
conv2d_4 (Conv2D) (None, 64, 272, 480) 36928 activation_3[0][0]
__________________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, 64, 272, 480) 256 conv2d_4[0][0]
__________________________________________________________________________________________________
activation_4 (Activation) (None, 64, 272, 480) 0 batch_normalization_4[0][0]
__________________________________________________________________________________________________
max_pooling2d_2 (MaxPooling2D) (None, 64, 136, 240) 0 activation_4[0][0]
__________________________________________________________________________________________________
conv2d_5 (Conv2D) (None, 128, 136, 240 73856 max_pooling2d_2[0][0]
__________________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, 128, 136, 240 512 conv2d_5[0][0]
__________________________________________________________________________________________________
activation_5 (Activation) (None, 128, 136, 240 0 batch_normalization_5[0][0]
__________________________________________________________________________________________________
conv2d_6 (Conv2D) (None, 128, 136, 240 147584 activation_5[0][0]
__________________________________________________________________________________________________
batch_normalization_6 (BatchNor (None, 128, 136, 240 512 conv2d_6[0][0]
__________________________________________________________________________________________________
activation_6 (Activation) (None, 128, 136, 240 0 batch_normalization_6[0][0]
__________________________________________________________________________________________________
max_pooling2d_3 (MaxPooling2D) (None, 128, 68, 120) 0 activation_6[0][0]
__________________________________________________________________________________________________
conv2d_7 (Conv2D) (None, 256, 68, 120) 295168 max_pooling2d_3[0][0]
__________________________________________________________________________________________________
batch_normalization_7 (BatchNor (None, 256, 68, 120) 1024 conv2d_7[0][0]
__________________________________________________________________________________________________
activation_7 (Activation) (None, 256, 68, 120) 0 batch_normalization_7[0][0]
__________________________________________________________________________________________________
conv2d_8 (Conv2D) (None, 256, 68, 120) 590080 activation_7[0][0]
__________________________________________________________________________________________________
batch_normalization_8 (BatchNor (None, 256, 68, 120) 1024 conv2d_8[0][0]
__________________________________________________________________________________________________
activation_8 (Activation) (None, 256, 68, 120) 0 batch_normalization_8[0][0]
__________________________________________________________________________________________________
max_pooling2d_4 (MaxPooling2D) (None, 256, 34, 60) 0 activation_8[0][0]
__________________________________________________________________________________________________
conv2d_9 (Conv2D) (None, 512, 34, 60) 1180160 max_pooling2d_4[0][0]
__________________________________________________________________________________________________
batch_normalization_9 (BatchNor (None, 512, 34, 60) 2048 conv2d_9[0][0]
__________________________________________________________________________________________________
activation_9 (Activation) (None, 512, 34, 60) 0 batch_normalization_9[0][0]
__________________________________________________________________________________________________
conv2d_10 (Conv2D) (None, 512, 34, 60) 2359808 activation_9[0][0]
__________________________________________________________________________________________________
batch_normalization_10 (BatchNo (None, 512, 34, 60) 2048 conv2d_10[0][0]
__________________________________________________________________________________________________
activation_10 (Activation) (None, 512, 34, 60) 0 batch_normalization_10[0][0]
__________________________________________________________________________________________________
max_pooling2d_5 (MaxPooling2D) (None, 512, 17, 30) 0 activation_10[0][0]
__________________________________________________________________________________________________
conv2d_11 (Conv2D) (None, 1024, 17, 30) 4719616 max_pooling2d_5[0][0]
__________________________________________________________________________________________________
batch_normalization_11 (BatchNo (None, 1024, 17, 30) 4096 conv2d_11[0][0]
__________________________________________________________________________________________________
activation_11 (Activation) (None, 1024, 17, 30) 0 batch_normalization_11[0][0]
__________________________________________________________________________________________________
conv2d_12 (Conv2D) (None, 1024, 17, 30) 9438208 activation_11[0][0]
__________________________________________________________________________________________________
batch_normalization_12 (BatchNo (None, 1024, 17, 30) 4096 conv2d_12[0][0]
__________________________________________________________________________________________________
activation_12 (Activation) (None, 1024, 17, 30) 0 batch_normalization_12[0][0]
__________________________________________________________________________________________________
conv2d_transpose_1 (Conv2DTrans (None, 512, 34, 60) 2097664 activation_12[0][0]
__________________________________________________________________________________________________
concatenate_1 (Concatenate) (None, 1024, 34, 60) 0 activation_10[0][0]
conv2d_transpose_1[0][0]
__________________________________________________________________________________________________
batch_normalization_13 (BatchNo (None, 1024, 34, 60) 4096 concatenate_1[0][0]
__________________________________________________________________________________________________
activation_13 (Activation) (None, 1024, 34, 60) 0 batch_normalization_13[0][0]
__________________________________________________________________________________________________
conv2d_13 (Conv2D) (None, 512, 34, 60) 4719104 activation_13[0][0]
__________________________________________________________________________________________________
batch_normalization_14 (BatchNo (None, 512, 34, 60) 2048 conv2d_13[0][0]
__________________________________________________________________________________________________
activation_14 (Activation) (None, 512, 34, 60) 0 batch_normalization_14[0][0]
__________________________________________________________________________________________________
conv2d_14 (Conv2D) (None, 512, 34, 60) 2359808 activation_14[0][0]
__________________________________________________________________________________________________
batch_normalization_15 (BatchNo (None, 512, 34, 60) 2048 conv2d_14[0][0]
__________________________________________________________________________________________________
activation_15 (Activation) (None, 512, 34, 60) 0 batch_normalization_15[0][0]
__________________________________________________________________________________________________
conv2d_transpose_2 (Conv2DTrans (None, 256, 68, 120) 524544 activation_15[0][0]
__________________________________________________________________________________________________
concatenate_2 (Concatenate) (None, 512, 68, 120) 0 activation_8[0][0]
conv2d_transpose_2[0][0]
__________________________________________________________________________________________________
batch_normalization_16 (BatchNo (None, 512, 68, 120) 2048 concatenate_2[0][0]
__________________________________________________________________________________________________
activation_16 (Activation) (None, 512, 68, 120) 0 batch_normalization_16[0][0]
__________________________________________________________________________________________________
conv2d_15 (Conv2D) (None, 256, 68, 120) 1179904 activation_16[0][0]
__________________________________________________________________________________________________
batch_normalization_17 (BatchNo (None, 256, 68, 120) 1024 conv2d_15[0][0]
__________________________________________________________________________________________________
activation_17 (Activation) (None, 256, 68, 120) 0 batch_normalization_17[0][0]
__________________________________________________________________________________________________
conv2d_16 (Conv2D) (None, 256, 68, 120) 590080 activation_17[0][0]
__________________________________________________________________________________________________
batch_normalization_18 (BatchNo (None, 256, 68, 120) 1024 conv2d_16[0][0]
__________________________________________________________________________________________________
activation_18 (Activation) (None, 256, 68, 120) 0 batch_normalization_18[0][0]
__________________________________________________________________________________________________
conv2d_transpose_3 (Conv2DTrans (None, 128, 136, 240 131200 activation_18[0][0]
__________________________________________________________________________________________________
concatenate_3 (Concatenate) (None, 256, 136, 240 0 activation_6[0][0]
conv2d_transpose_3[0][0]
__________________________________________________________________________________________________
batch_normalization_19 (BatchNo (None, 256, 136, 240 1024 concatenate_3[0][0]
__________________________________________________________________________________________________
activation_19 (Activation) (None, 256, 136, 240 0 batch_normalization_19[0][0]
__________________________________________________________________________________________________
conv2d_17 (Conv2D) (None, 128, 136, 240 295040 activation_19[0][0]
__________________________________________________________________________________________________
batch_normalization_20 (BatchNo (None, 128, 136, 240 512 conv2d_17[0][0]
__________________________________________________________________________________________________
activation_20 (Activation) (None, 128, 136, 240 0 batch_normalization_20[0][0]
__________________________________________________________________________________________________
conv2d_18 (Conv2D) (None, 128, 136, 240 147584 activation_20[0][0]
__________________________________________________________________________________________________
batch_normalization_21 (BatchNo (None, 128, 136, 240 512 conv2d_18[0][0]
__________________________________________________________________________________________________
activation_21 (Activation) (None, 128, 136, 240 0 batch_normalization_21[0][0]
__________________________________________________________________________________________________
conv2d_transpose_4 (Conv2DTrans (None, 64, 272, 480) 32832 activation_21[0][0]
__________________________________________________________________________________________________
concatenate_4 (Concatenate) (None, 128, 272, 480 0 activation_4[0][0]
conv2d_transpose_4[0][0]
__________________________________________________________________________________________________
batch_normalization_22 (BatchNo (None, 128, 272, 480 512 concatenate_4[0][0]
__________________________________________________________________________________________________
activation_22 (Activation) (None, 128, 272, 480 0 batch_normalization_22[0][0]
__________________________________________________________________________________________________
conv2d_19 (Conv2D) (None, 64, 272, 480) 73792 activation_22[0][0]
__________________________________________________________________________________________________
batch_normalization_23 (BatchNo (None, 64, 272, 480) 256 conv2d_19[0][0]
__________________________________________________________________________________________________
activation_23 (Activation) (None, 64, 272, 480) 0 batch_normalization_23[0][0]
__________________________________________________________________________________________________
conv2d_20 (Conv2D) (None, 64, 272, 480) 36928 activation_23[0][0]
__________________________________________________________________________________________________
batch_normalization_24 (BatchNo (None, 64, 272, 480) 256 conv2d_20[0][0]
__________________________________________________________________________________________________
activation_24 (Activation) (None, 64, 272, 480) 0 batch_normalization_24[0][0]
__________________________________________________________________________________________________
conv2d_transpose_5 (Conv2DTrans (None, 32, 544, 960) 8224 activation_24[0][0]
__________________________________________________________________________________________________
concatenate_5 (Concatenate) (None, 64, 544, 960) 0 activation_2[0][0]
conv2d_transpose_5[0][0]
__________________________________________________________________________________________________
batch_normalization_25 (BatchNo (None, 64, 544, 960) 256 concatenate_5[0][0]
__________________________________________________________________________________________________
activation_25 (Activation) (None, 64, 544, 960) 0 batch_normalization_25[0][0]
__________________________________________________________________________________________________
conv2d_21 (Conv2D) (None, 32, 544, 960) 18464 activation_25[0][0]
__________________________________________________________________________________________________
batch_normalization_26 (BatchNo (None, 32, 544, 960) 128 conv2d_21[0][0]
__________________________________________________________________________________________________
activation_26 (Activation) (None, 32, 544, 960) 0 batch_normalization_26[0][0]
__________________________________________________________________________________________________
conv2d_22 (Conv2D) (None, 32, 544, 960) 9248 activation_26[0][0]
__________________________________________________________________________________________________
batch_normalization_27 (BatchNo (None, 32, 544, 960) 128 conv2d_22[0][0]
__________________________________________________________________________________________________
activation_27 (Activation) (None, 32, 544, 960) 0 batch_normalization_27[0][0]
__________________________________________________________________________________________________
conv2d_23 (Conv2D) (None, 2, 544, 960) 66 activation_27[0][0]
==================================================================================================
Total params: 31,126,530
Trainable params: 31,110,530
Non-trainable params: 16,000
__________________________________________________________________________________________________
INFO:tensorflow:Done calling model_fn.
2021-10-26 03:42:29,115 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2021-10-26 03:42:32,152 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2021-10-26 03:42:34,197 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2021-10-26 03:42:34,367 [INFO] tensorflow: Done running local_init_op.
[GPU] Restoring pretrained weights from: /tmp/tmpw0l7bx6m/model.ckpt-4960
2021-10-26 03:42:35,517 [INFO] iva.unet.hooks.pretrained_restore_hook: Pretrained weights loaded with success...
INFO:tensorflow:Saving checkpoints for step-4960.
2021-10-26 03:42:41,262 [INFO] tensorflow: Saving checkpoints for step-4960.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:95: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.
2021-10-26 03:42:48,403 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:95: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.
Epoch: 5/50:, Cur-Step: 4960, loss(cross_dice_sum): 0.01901, Running average loss:0.01901, Time taken: 0:00:00 ETA: 0:00:00
2021-10-26 03:43:01,564 [INFO] __main__: Epoch: 5/50:, Cur-Step: 4960, loss(cross_dice_sum): 0.01901, Running average loss:0.01901, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 5/50:, Cur-Step: 4970, loss(cross_dice_sum): 0.00755, Running average loss:0.01389, Time taken: 0:00:00 ETA: 0:00:00
2021-10-26 03:43:10,670 [INFO] __main__: Epoch: 5/50:, Cur-Step: 4970, loss(cross_dice_sum): 0.00755, Running average loss:0.01389, Time taken: 0:00:00 ETA: 0:00:00
ERROR:tensorflow:Model diverged with loss = NaN.
2021-10-26 03:43:12,717 [ERROR] tensorflow: Model diverged with loss = NaN.
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 419, in <module>
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 413, in main
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 314, in run_experiment
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 229, in train_unet
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 105, in run_training_loop
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1426, in run
run_metadata=run_metadata))
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/basic_session_run_hooks.py", line 761, in after_run
raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.
2021-10-26 03:43:16,227 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.