Error wile using TLT pretrained model tlt_semantic_segmentation:resnet101

Hi,

I am trying to run TLT training with TLT Pretrained Semantic Segmentation. when I execute the tlt unet train command, it will generate the following error message after done calling model_fn,
----- log start -----
2021-06-25 14:35:26,590 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2021-06-25 14:35:35,694 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2021-06-25 14:35:43,365 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2021-06-25 14:35:44,216 [INFO] tensorflow: Done running local_init_op.
[GPU] Restoring pretrained weights from: /tmp/tmp9jz18vv5/model.ckpt-1
2021-06-25 14:35:48,403 [INFO] iva.unet.hooks.pretrained_restore_hook: Pretrained weights loaded with success…

INFO:tensorflow:Saving checkpoints for step-0.
2021-06-25 14:36:10,302 [INFO] tensorflow: Saving checkpoints for step-0.
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:92: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2021-06-25 14:36:30,596 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:92: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: ValueError: generator yielded an element of shape (547, 626, 3) where an element of shape (1010, 1220, 3) was expected.
Traceback (most recent call last):

File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/script_ops.py”, line 235, in call
ret = func(*args)

File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 674, in generator_py_func
“of shape %s was expected.” % (ret_array.shape, expected_shape))

ValueError: generator yielded an element of shape (547, 626, 3) where an element of shape (1010, 1220, 3) was expected.

 [[{{node PyFunc}}]]
 [[IteratorGetNext]]
 [[IteratorGetNext/_24519]]

(1) Invalid argument: ValueError: generator yielded an element of shape (547, 626, 3) where an element of shape (1010, 1220, 3) was expected.
Traceback (most recent call last):

File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/script_ops.py”, line 235, in call
ret = func(*args)

File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 674, in generator_py_func
“of shape %s was expected.” % (ret_array.shape, expected_shape))

ValueError: generator yielded an element of shape (547, 626, 3) where an element of shape (1010, 1220, 3) was expected.

 [[{{node PyFunc}}]]
 [[IteratorGetNext]]

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 403, in
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 397, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 298, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 217, in train_unet
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 104, in run_training_loop
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1195, in _train_model_default
saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 754, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1259, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1360, in run
raise six.reraise(*original_exc_info)
File “/usr/local/lib/python3.6/dist-packages/six.py”, line 696, in reraise
raise value
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1345, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1418, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1176, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: ValueError: generator yielded an element of shape (547, 626, 3) where an element of shape (1010, 1220, 3) was expected.
Traceback (most recent call last):

File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/script_ops.py”, line 235, in call
ret = func(*args)

File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 674, in generator_py_func
“of shape %s was expected.” % (ret_array.shape, expected_shape))

ValueError: generator yielded an element of shape (547, 626, 3) where an element of shape (1010, 1220, 3) was expected.

 [[{{node PyFunc}}]]
 [[IteratorGetNext]]
 [[IteratorGetNext/_24519]]

(1) Invalid argument: ValueError: generator yielded an element of shape (547, 626, 3) where an element of shape (1010, 1220, 3) was expected.
Traceback (most recent call last):

File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/script_ops.py”, line 235, in call
ret = func(*args)

File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 674, in generator_py_func
“of shape %s was expected.” % (ret_array.shape, expected_shape))

ValueError: generator yielded an element of shape (547, 626, 3) where an element of shape (1010, 1220, 3) was expected.

 [[{{node PyFunc}}]]
 [[IteratorGetNext]]

0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
File “/usr/local/bin/unet”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/entrypoint/unet.py”, line 12, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 296, in launch_job
AssertionError: Process run failed.
2021-06-25 14:36:44,014 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

---- end ----
Please help to figure how to fix this error?

Please provide the following information when requesting support.

• Hardware (V100)
• dockers: [‘nvcr.io/nvidia/tlt-streamanalytics’, ‘nvcr.io/nvidia/tlt-pytorch’]
format_version: 1.0
tlt_version: 3.0
published_date: 02/02/2021
• Training spec file(If have, please share here)
random_seed: 42
model_config {
model_input_width: 320
model_input_height: 320
model_input_channels: 3
num_layers: 101
all_projections: true
arch: “resnet”
use_batch_norm: true
training_precision {
backend_floatx: FLOAT32
}
}

training_config {
batch_size: 4
epochs: 10
log_summary_steps: 10
checkpoint_interval: 1
loss: “cross_dice_sum”
learning_rate:0.0001
regularizer {
type: L2
weight: 3.00000002618e-09
}
optimizer {
adam {
epsilon: 9.99999993923e-09
beta1: 0.899999976158
beta2: 0.999000012875
}
}
}

dataset_config {

dataset: “custom”
augment: True
input_image_type: “color”
train_images_path:“/workspace/tlt-experiments/Kvasir-SEG_TLT/images/train”
train_masks_path:“/workspace/tlt-experiments/Kvasir-SEG_TLT/masks/train”

val_images_path:“/workspace/tlt-experiments/Kvasir-SEG_TLT/images/val”
val_masks_path:“/workspace/tlt-experiments/Kvasir-SEG_TLT/masks/val”

test_images_path:“/workspace/tlt-experiments/Kvasir-SEG_TLT/images/test”

data_class_config {
target_classes {
name: “foreground”
mapping_class: “foreground”
label_id: 0
}
target_classes {
name: “background”
mapping_class: “background”
label_id: 1
}
}

}

• How to reproduce the issue ?
run the following command
!tlt unet train --gpus=1 --gpu_index=0
-e /workspace/tlt-experiments/examples/unet/specs/unet_train_resnet_unet_Kvasir_SEG.txt
-r /workspace/tlt-experiments/unetKvasir_SEG_experiment_unpruned
-m /workspace/tlt-experiments/unet/pretrained_resnet101/tlt_semantic_segmentation_vresnet101/resnet_101.hdf5
-n model_Kvasir_SEG
-k nvidia_tlt

According above “published_data”, I am afraid you are using 3.0-dp version. Could you double check if it is 3.0-dp-py3 version of 3.0-py3 version? For 3.0-dp-py3 version, it is needed to resize images offline. But for 3.0-py3 version , it is not needed to resize.

Hi Morganh,

Thanks for reply, Yes, I am using 3.0-dp-py3.
Could you share how to switch to 3.0-py3?

Thanks,

See NVIDIA TAO Documentation

The nvidia-tlt package is hosted in the nvidia-pyindex , which has to be installed as a pre-requisite to install nvidia-tlt .

If you had installed an older version of nvidia-tlt launcher, you may upgrade to the latest version by running the following command.

pip3 install --upgrade nvidia-tlt

Hi Morgan,

Thanks for help. after upgrading tlt, the training still failed but error message is different

– log start –
INFO:tensorflow:Done calling model_fn.
2021-06-28 03:25:07,284 [INFO] tensorflow: Done calling model_fn.
Traceback (most recent call last):
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 419, in
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 413, in main
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 314, in run_experiment
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 229, in train_unet
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 105, in run_training_loop
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1195, in _train_model_default
saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1490, in _train_with_estimator_spec
log_step_count_steps=log_step_count_steps) as mon_sess:
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 584, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1014, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 713, in init
h.begin()
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/pretrained_restore_hook.py”, line 205, in begin
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/pretrained_restore_hook.py”, line 113, in assign_from_checkpoint
ValueError: Total size of new array must be unchanged for conv2d_4/kernel lh_shape: [(3, 3, 67, 64)], rh_shape: [(3, 3, 64, 64)]
2021-06-28 03:25:14,310 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

–end –

the tlt version now is
Configuration of the TLT Instance
dockers: [‘nvidia/tlt-streamanalytics’, ‘nvidia/tlt-pytorch’]
format_version: 1.0
tlt_version: 3.0
published_date: 04/16/2021

Do you have any suggestion?

Thanks,
Ted

Can you try with a new result folder?
For example,
-r /workspace/tlt-experiments/unetKvasir_SEG_experiment_unpruned_new

1 Like

Hi Morgan,

Thanks, it works now.

Thanks,
Ted

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.