Error wile using TLT pretrained model tlt_semantic_segmentation:resnet101

Ted.Lin · June 25, 2021, 3:17pm

Hi,

I am trying to run TLT training with TLT Pretrained Semantic Segmentation. when I execute the tlt unet train command, it will generate the following error message after done calling model_fn,
----- log start -----
2021-06-25 14:35:26,590 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2021-06-25 14:35:35,694 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2021-06-25 14:35:43,365 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2021-06-25 14:35:44,216 [INFO] tensorflow: Done running local_init_op.
[GPU] Restoring pretrained weights from: /tmp/tmp9jz18vv5/model.ckpt-1
2021-06-25 14:35:48,403 [INFO] iva.unet.hooks.pretrained_restore_hook: Pretrained weights loaded with success…

INFO:tensorflow:Saving checkpoints for step-0.
2021-06-25 14:36:10,302 [INFO] tensorflow: Saving checkpoints for step-0.
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:92: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2021-06-25 14:36:30,596 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:92: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: ValueError: generator yielded an element of shape (547, 626, 3) where an element of shape (1010, 1220, 3) was expected.
Traceback (most recent call last):

File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/script_ops.py”, line 235, in call
ret = func(*args)

File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 674, in generator_py_func
“of shape %s was expected.” % (ret_array.shape, expected_shape))

ValueError: generator yielded an element of shape (547, 626, 3) where an element of shape (1010, 1220, 3) was expected.

 [[{{node PyFunc}}]]
 [[IteratorGetNext]]
 [[IteratorGetNext/_24519]]

(1) Invalid argument: ValueError: generator yielded an element of shape (547, 626, 3) where an element of shape (1010, 1220, 3) was expected.
Traceback (most recent call last):

File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/script_ops.py”, line 235, in call
ret = func(*args)

File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 674, in generator_py_func
“of shape %s was expected.” % (ret_array.shape, expected_shape))

ValueError: generator yielded an element of shape (547, 626, 3) where an element of shape (1010, 1220, 3) was expected.

 [[{{node PyFunc}}]]
 [[IteratorGetNext]]

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 403, in
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 397, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 298, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 217, in train_unet
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 104, in run_training_loop
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1195, in _train_model_default
saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 754, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1259, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1360, in run
raise six.reraise(*original_exc_info)
File “/usr/local/lib/python3.6/dist-packages/six.py”, line 696, in reraise
raise value
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1345, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1418, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1176, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: ValueError: generator yielded an element of shape (547, 626, 3) where an element of shape (1010, 1220, 3) was expected.
Traceback (most recent call last):

File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/script_ops.py”, line 235, in call
ret = func(*args)

File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 674, in generator_py_func
“of shape %s was expected.” % (ret_array.shape, expected_shape))

ValueError: generator yielded an element of shape (547, 626, 3) where an element of shape (1010, 1220, 3) was expected.

 [[{{node PyFunc}}]]
 [[IteratorGetNext]]
 [[IteratorGetNext/_24519]]

(1) Invalid argument: ValueError: generator yielded an element of shape (547, 626, 3) where an element of shape (1010, 1220, 3) was expected.
Traceback (most recent call last):

File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/script_ops.py”, line 235, in call
ret = func(*args)

File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py”, line 674, in generator_py_func
“of shape %s was expected.” % (ret_array.shape, expected_shape))

ValueError: generator yielded an element of shape (547, 626, 3) where an element of shape (1010, 1220, 3) was expected.

 [[{{node PyFunc}}]]
 [[IteratorGetNext]]

0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
File “/usr/local/bin/unet”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/entrypoint/unet.py”, line 12, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 296, in launch_job
AssertionError: Process run failed.
2021-06-25 14:36:44,014 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

---- end ----
Please help to figure how to fix this error?

Please provide the following information when requesting support.

• Hardware (V100)
• dockers: [‘nvcr.io/nvidia/tlt-streamanalytics’, ‘nvcr.io/nvidia/tlt-pytorch’]
format_version: 1.0
tlt_version: 3.0
published_date: 02/02/2021
• Training spec file(If have, please share here)
random_seed: 42
model_config {
model_input_width: 320
model_input_height: 320
model_input_channels: 3
num_layers: 101
all_projections: true
arch: “resnet”
use_batch_norm: true
training_precision {
backend_floatx: FLOAT32
}
}

training_config {
batch_size: 4
epochs: 10
log_summary_steps: 10
checkpoint_interval: 1
loss: “cross_dice_sum”
learning_rate:0.0001
regularizer {
type: L2
weight: 3.00000002618e-09
}
optimizer {
adam {
epsilon: 9.99999993923e-09
beta1: 0.899999976158
beta2: 0.999000012875
}
}
}

dataset_config {

dataset: “custom”
augment: True
input_image_type: “color”
train_images_path:“/workspace/tlt-experiments/Kvasir-SEG_TLT/images/train”
train_masks_path:“/workspace/tlt-experiments/Kvasir-SEG_TLT/masks/train”

val_images_path:“/workspace/tlt-experiments/Kvasir-SEG_TLT/images/val”
val_masks_path:“/workspace/tlt-experiments/Kvasir-SEG_TLT/masks/val”

test_images_path:“/workspace/tlt-experiments/Kvasir-SEG_TLT/images/test”

data_class_config {
target_classes {
name: “foreground”
mapping_class: “foreground”
label_id: 0
}
target_classes {
name: “background”
mapping_class: “background”
label_id: 1
}
}

}

• How to reproduce the issue ?
run the following command
!tlt unet train --gpus=1 --gpu_index=0
-e /workspace/tlt-experiments/examples/unet/specs/unet_train_resnet_unet_Kvasir_SEG.txt
-r /workspace/tlt-experiments/unetKvasir_SEG_experiment_unpruned
-m /workspace/tlt-experiments/unet/pretrained_resnet101/tlt_semantic_segmentation_vresnet101/resnet_101.hdf5
-n model_Kvasir_SEG
-k nvidia_tlt

Morganh · June 26, 2021, 5:12am

According above “published_data”, I am afraid you are using 3.0-dp version. Could you double check if it is 3.0-dp-py3 version of 3.0-py3 version? For 3.0-dp-py3 version, it is needed to resize images offline. But for 3.0-py3 version , it is not needed to resize.

Ted.Lin · June 27, 2021, 2:22pm

Hi Morganh,

Thanks for reply, Yes, I am using 3.0-dp-py3.
Could you share how to switch to 3.0-py3?

Thanks,

Morganh · June 27, 2021, 3:10pm

See NVIDIA TAO Documentation

The nvidia-tlt package is hosted in the nvidia-pyindex , which has to be installed as a pre-requisite to install nvidia-tlt .

If you had installed an older version of nvidia-tlt launcher, you may upgrade to the latest version by running the following command.

pip3 install --upgrade nvidia-tlt

Ted.Lin · June 28, 2021, 3:31am

Hi Morgan,

Thanks for help. after upgrading tlt, the training still failed but error message is different

– log start –
INFO:tensorflow:Done calling model_fn.
2021-06-28 03:25:07,284 [INFO] tensorflow: Done calling model_fn.
Traceback (most recent call last):
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 419, in
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 413, in main
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 314, in run_experiment
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 229, in train_unet
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py”, line 105, in run_training_loop
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1195, in _train_model_default
saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1490, in _train_with_estimator_spec
log_step_count_steps=log_step_count_steps) as mon_sess:
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 584, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1014, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 713, in init
h.begin()
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/pretrained_restore_hook.py”, line 205, in begin
File “/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/pretrained_restore_hook.py”, line 113, in assign_from_checkpoint
ValueError: Total size of new array must be unchanged for conv2d_4/kernel lh_shape: [(3, 3, 67, 64)], rh_shape: [(3, 3, 64, 64)]
2021-06-28 03:25:14,310 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

–end –

the tlt version now is
Configuration of the TLT Instance
dockers: [‘nvidia/tlt-streamanalytics’, ‘nvidia/tlt-pytorch’]
format_version: 1.0
tlt_version: 3.0
published_date: 04/16/2021

Do you have any suggestion?

Thanks,
Ted

Morganh · June 28, 2021, 3:58am

Can you try with a new result folder?
For example,
-r /workspace/tlt-experiments/unetKvasir_SEG_experiment_unpruned_new

Ted.Lin · June 28, 2021, 4:30am

Hi Morgan,

Thanks, it works now.

Thanks,
Ted

system · August 27, 2021, 4:30am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TLT Detectnet TrafficCamNet training not working TAO Toolkit	10	2482	October 12, 2021
Train with my own tlt model #2 TAO Toolkit	42	2776	February 8, 2022
Train with my own tlt model TAO Toolkit	14	709	December 13, 2021
Tlt lprnet export error, TypeError: set_data_preprocessing_parameters() got an unexpected keyword argument 'image_mean' TAO Toolkit	7	1242	October 12, 2021
Tlt unet evaluate failed TAO Toolkit	10	502	September 18, 2021
TLT V2.0 Classification TAO Toolkit	26	2785	August 3, 2021
An error occurred while training with TLT TAO Toolkit	11	695	October 12, 2021
Tlt3.0 retrain trafficcamnet getting the error when do the evaluation TAO Toolkit	31	1614	October 12, 2021
Training emotionnet with tao toolkit through Jupyter Notebook TAO Toolkit	26	887	December 12, 2022
Not able to deploy .etlt file in deepstream test app 1 TAO Toolkit	12	1818	October 12, 2021

Error wile using TLT pretrained model tlt_semantic_segmentation:resnet101

Related topics