TAO Toolkit trainung Unet stops when saving checkpoints

Hi NVIDIA Devs,

I am trying to train a UNET with TAO-Toolkit in wsl2. It kind of works but at the end of every couple of epochs it crashes while saving checkpoints.
Error message:
INFO:tensorflow:Saving checkpoints for step-57304.
2024-07-30 13:15:02,828 [TAO Toolkit] [INFO] tensorflow 76: Saving checkpoints for step-57304.
2024-07-30 13:17:55,288 [TAO Toolkit] [INFO] root 2102: Dst tensor is not initialized.
[[node block_3a_conv_shortcut/kernel/Adam (defined at /tensorflow_core/python/framework/ops.py:1748) ]]

I can continue training from last checkpoint, but it’s a pain to always watch training and restart it after couple of epochs.
I googled a bit and only found threads saying that there is not enough GPU or CPU RAM, but I have plenty (see hardware below). I also reduced batch_size to 1 and still the error occurs.
I am tracing GPU memory with nvidia-smi --query-gpu=memory.used,memory.total --format=csv -i 0 -l 1
memory.used: 5070 MiB
memory.total: 16384 MiB

And CPU RAM with free -m -s 1
total: 54223
used: 3017
free: 48336

So I don’t know how there cannot be enough memory to cause the error. If that is the problem though.

Tank you in advance

More Information:
• Hardware: RTX A4500 Mobile, 16GB RAM ; 64GB CPU RAM
• Network Type: UNET
• OS: Windows 10 Enterprise → WSL2 Ubuntu 22.04, Docker v24.0.7

• TAO Info:
Configuration of the TAO Toolkit Instance
task_group: [‘model’, ‘dataset’, ‘deploy’]
format_version: 3.0
toolkit_version: 5.3.0
published_date: 03/14/2024

• Training Data: 6000 images, ~ 1GB of data in total

• .tao_mounts.json
{
“Mounts”: [
{
“source”: “/mnt/c/TAO-Toolkit”,
“destination”: “/workspace”
}
],
“DockerOptions”: {
“shm_size”: “16G”,
“ulimits”: {
“memlock”: -1,
“stack”: 67108864
},
“user”: “1000:1000”,
“ports”: {
“8888”: 8888
}
}
}

• Training spec file:

random_seed: 42
model_config {
model_input_width: 400
model_input_height: 224
model_input_channels: 3
num_layers: 18
all_projections: True
arch: “resnet”
use_batch_norm: False
training_precision {
backend_floatx: FLOAT32
}
}

training_config {
batch_size: 1
epochs: 20
log_summary_steps: 10
checkpoint_interval: 1
loss: “cross_entropy”
learning_rate:0.0001
regularizer {
type: L2
weight: 2e-5
}
optimizer {
adam {
epsilon: 9.99999993923e-09
beta1: 0.899999976158
beta2: 0.999000012875
}
}
visualizer{
enabled: true
}
}

dataset_config {
dataset: “custom”
augment: False
augmentation_config {
spatial_augmentation {
hflip_probability : 0.5
vflip_probability : 0.0
crop_and_resize_prob : 0.5
}
brightness_augmentation {
delta: 0.2
}
}

input_image_type: “color”
train_images_path:“/workspace/unet/dataset/train/img”
train_masks_path:“/workspace/unet/dataset/train/labels”

val_images_path:“/workspace/unet/dataset/validate/img”
val_masks_path:“/workspace/unet/dataset/validate/labels”

test_images_path:“/workspace/unet/dataset/test”

data_class_config {
target_classes {
name: “foreground”
mapping_class: “foreground”
label_id: 0
}
target_classes {
name: “background”
mapping_class: “background”
label_id: 255
}
}
}

Does it mean the intermediate checkpoints are failed to save, but the last checkpoint can be saved? Can you share the full log?

Hi Morganh,

some of the intermediate cannot be saved. Whenever this happens the container stops. When having 20 epochs, epoch 12 always crashes. So I have never reached the last one. When I use 10 epochs I can reach the last checkpoint with restarting training a couple of times from previous checkpoints.

Here is the complete Log of a crash after first epoch:


Epoch: 0.996353/20:, Cur-Step: 6010, loss(cross_entropy): 0.65571, Running average loss:0.17416, Time taken: 0 ETA: 0.0
2024-07-31 06:53:10,233 [TAO Toolkit] [INFO] main 161: Epoch: 0.996353/20:, Cur-Step: 6010, loss(cross_entropy): 0.65571, Running average loss:0.17416, Time taken: 0 ETA: 0.0
Epoch: 0.998011/20:, Cur-Step: 6020, loss(cross_entropy): 0.05284, Running average loss:0.17403, Time taken: 0 ETA: 0.0
2024-07-31 06:53:11,344 [TAO Toolkit] [INFO] main 161: Epoch: 0.998011/20:, Cur-Step: 6020, loss(cross_entropy): 0.05284, Running average loss:0.17403, Time taken: 0 ETA: 0.0
Epoch: 0.999668/20:, Cur-Step: 6030, loss(cross_entropy): 0.07221, Running average loss:0.17392, Time taken: 0 ETA: 0.0
2024-07-31 06:53:13,135 [TAO Toolkit] [INFO] main 161: Epoch: 0.999668/20:, Cur-Step: 6030, loss(cross_entropy): 0.07221, Running average loss:0.17392, Time taken: 0 ETA: 0.0
INFO:tensorflow:Saving checkpoints for step-6032.
2024-07-31 06:53:13,423 [TAO Toolkit] [INFO] tensorflow 76: Saving checkpoints for step-6032.
2024-07-31 06:54:18,771 [TAO Toolkit] [INFO] root 2102: Dst tensor is not initialized.
[[node conv2d_3/kernel/Adam (defined at /tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for ‘conv2d_3/kernel/Adam’:
File “/nvidia_tao_tf1/cv/unet/scripts/train.py”, line 578, in
main()
File “/nvidia_tao_tf1/cv/unet/scripts/train.py”, line 551, in main
run_experiment(config_path=args.experiment_spec_file,
File “/nvidia_tao_tf1/cv/unet/scripts/train.py”, line 423, in run_experiment
train_unet(results_dir, experiment_spec, ptm, model_file,
File “/nvidia_tao_tf1/cv/unet/scripts/train.py”, line 298, in train_unet
run_training_loop(estimator, dataset, params, unet_model,
File “/nvidia_tao_tf1/cv/unet/scripts/train.py”, line 131, in run_training_loop
estimator.train(
File “/tensorflow_estimator/python/estimator/estimator.py”, line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/tensorflow_estimator/python/estimator/estimator.py”, line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/tensorflow_estimator/python/estimator/estimator.py”, line 1190, in _train_model_default
estimator_spec = self._call_model_fn(
File “/tensorflow_estimator/python/estimator/estimator.py”, line 1149, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File “/nvidia_tao_tf1/cv/unet/utils/model_fn.py”, line 356, in unet_fn
train_op = opt.minimize(total_loss, gate_gradients=gate_gradients,
File “/tensorflow_core/python/training/optimizer.py”, line 428, in minimize
return self.apply_gradients(grads_and_vars, global_step=global_step,
File “/tensorflow_core/python/training/optimizer.py”, line 687, in apply_gradients
maybe_apply_op = smart_cond.smart_cond(should_apply_grads, apply_fn,
File “/tensorflow_core/python/framework/smart_cond.py”, line 58, in smart_cond
return control_flow_ops.cond(pred, true_fn=true_fn, false_fn=false_fn,
File “/tensorflow_core/python/util/deprecation.py”, line 513, in new_func
return func(*args, **kwargs)
File “/tensorflow_core/python/ops/control_flow_ops.py”, line 1224, in cond
orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
File “/tensorflow_core/python/ops/control_flow_ops.py”, line 1061, in BuildCondBranch
original_result = fn()
File “/tensorflow_core/python/training/optimizer.py”, line 640, in apply_fn
self._create_slots(var_list)
File “/tensorflow_core/python/training/adam.py”, line 131, in _create_slots
self._zeros_slot(v, “m”, self._name)
File “/tensorflow_core/python/training/optimizer.py”, line 1224, in _zeros_slot
new_slot_variable = slot_creator.create_zeros_slot(var, op_name)
File “/tensorflow_core/python/training/slot_creator.py”, line 188, in create_zeros_slot
return create_slot_with_initializer(
File “/tensorflow_core/python/training/slot_creator.py”, line 163, in create_slot_with_initializer
return _create_slot_var(primary, initializer, “”, validate_shape, shape,
File “/tensorflow_core/python/training/slot_creator.py”, line 67, in _create_slot_var
slot = variable_scope.get_variable(
File “/tensorflow_core/python/ops/variable_scope.py”, line 1484, in get_variable
return get_variable_scope().get_variable(
File “/tensorflow_core/python/ops/variable_scope.py”, line 1227, in get_variable
return var_store.get_variable(
File “/tensorflow_core/python/ops/variable_scope.py”, line 552, in get_variable
return _true_getter(
File “/tensorflow_core/python/ops/variable_scope.py”, line 505, in _true_getter
return self._get_single_variable(
File “/tensorflow_core/python/ops/variable_scope.py”, line 922, in _get_single_variable
v = variables.VariableV1(
File “/tensorflow_core/python/ops/variables.py”, line 258, in call
return cls._variable_v1_call(*args, **kwargs)
File “/tensorflow_core/python/ops/variables.py”, line 204, in _variable_v1_call
return previous_getter(
File “/tensorflow_core/python/ops/variables.py”, line 197, in
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
File “/tensorflow_core/python/ops/variable_scope.py”, line 2505, in default_variable_creator
return variables.RefVariable(
File “/tensorflow_core/python/ops/variables.py”, line 262, in call
return super(VariableMetaclass, cls).call(*args, **kwargs)
File “/tensorflow_core/python/ops/variables.py”, line 1676, in init
self._init_from_args(
File “/tensorflow_core/python/ops/variables.py”, line 1823, in _init_from_args
self._variable = state_ops.variable_op_v2(
File “/tensorflow_core/python/ops/state_ops.py”, line 74, in variable_op_v2
return gen_state_ops.variable_v2(
File “/tensorflow_core/python/ops/gen_state_ops.py”, line 1619, in variable_v2
_, _, _op = _op_def_lib._apply_op_helper(
File “/tensorflow_core/python/framework/op_def_library.py”, line 792, in _apply_op_helper
op = g.create_op(op_type_name, inputs, dtypes=None, name=scope,
File “/tensorflow_core/python/util/deprecation.py”, line 513, in new_func
return func(*args, **kwargs)
File “/tensorflow_core/python/framework/ops.py”, line 3356, in create_op
return self._create_op_internal(op_type, inputs, dtypes, input_types, name,
File “/tensorflow_core/python/framework/ops.py”, line 3418, in _create_op_internal
ret = Operation(
File “/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py”, line 1349, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py”, line 1441, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized.
[[{{node conv2d_3/kernel/Adam}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/unet/scripts/train.py”, line 578, in
main()
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/unet/scripts/train.py”, line 570, in main
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/unet/scripts/train.py”, line 551, in main
run_experiment(config_path=args.experiment_spec_file,
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/unet/scripts/train.py”, line 423, in run_experiment
train_unet(results_dir, experiment_spec, ptm, model_file,
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/unet/scripts/train.py”, line 298, in train_unet
run_training_loop(estimator, dataset, params, unet_model,
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/unet/scripts/train.py”, line 131, in run_training_loop
estimator.train(
File “/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1193, in _train_model_default
return self._train_with_estimator_spec(estimator_spec, worker_hooks,
File “/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 750, in run
return self._sess.run(
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1255, in run
return self._sess.run(
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1360, in run
raise six.reraise(*original_exc_info)
File “/usr/local/lib/python3.8/dist-packages/six.py”, line 719, in reraise
raise value
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1345, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1421, in run
hook.after_run(
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/basic_session_run_hooks.py”, line 594, in after_run
if self._save(run_context.session, global_step):
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/unet/hooks/checkpoint_saver_hook.py”, line 85, in _save
self._save_checkpoint(session, step)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/unet/hooks/checkpoint_saver_hook.py”, line 104, in _save_checkpoint
saver.save(session, os.path.join(ckzip_folder, “model.ckpt”), global_step=epoch)
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/saver.py”, line 1174, in save
model_checkpoint_path = sess.run(
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py”, line 955, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py”, line 1179, in _run
results = self._do_run(handle, final_targets, final_fetches,
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py”, line 1358, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized.
[[node conv2d_3/kernel/Adam (defined at /tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for ‘conv2d_3/kernel/Adam’:
File “/nvidia_tao_tf1/cv/unet/scripts/train.py”, line 578, in
main()
File “/nvidia_tao_tf1/cv/unet/scripts/train.py”, line 551, in main
run_experiment(config_path=args.experiment_spec_file,
File “/nvidia_tao_tf1/cv/unet/scripts/train.py”, line 423, in run_experiment
train_unet(results_dir, experiment_spec, ptm, model_file,
File “/nvidia_tao_tf1/cv/unet/scripts/train.py”, line 298, in train_unet
run_training_loop(estimator, dataset, params, unet_model,
File “/nvidia_tao_tf1/cv/unet/scripts/train.py”, line 131, in run_training_loop
estimator.train(
File “/tensorflow_estimator/python/estimator/estimator.py”, line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/tensorflow_estimator/python/estimator/estimator.py”, line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/tensorflow_estimator/python/estimator/estimator.py”, line 1190, in _train_model_default
estimator_spec = self._call_model_fn(
File “/tensorflow_estimator/python/estimator/estimator.py”, line 1149, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File “/nvidia_tao_tf1/cv/unet/utils/model_fn.py”, line 356, in unet_fn
train_op = opt.minimize(total_loss, gate_gradients=gate_gradients,
File “/tensorflow_core/python/training/optimizer.py”, line 428, in minimize
return self.apply_gradients(grads_and_vars, global_step=global_step,
File “/tensorflow_core/python/training/optimizer.py”, line 687, in apply_gradients
maybe_apply_op = smart_cond.smart_cond(should_apply_grads, apply_fn,
File “/tensorflow_core/python/framework/smart_cond.py”, line 58, in smart_cond
return control_flow_ops.cond(pred, true_fn=true_fn, false_fn=false_fn,
File “/tensorflow_core/python/util/deprecation.py”, line 513, in new_func
return func(*args, **kwargs)
File “/tensorflow_core/python/ops/control_flow_ops.py”, line 1224, in cond
orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
File “/tensorflow_core/python/ops/control_flow_ops.py”, line 1061, in BuildCondBranch
original_result = fn()
File “/tensorflow_core/python/training/optimizer.py”, line 640, in apply_fn
self._create_slots(var_list)
File “/tensorflow_core/python/training/adam.py”, line 131, in _create_slots
self._zeros_slot(v, “m”, self._name)
File “/tensorflow_core/python/training/optimizer.py”, line 1224, in _zeros_slot
new_slot_variable = slot_creator.create_zeros_slot(var, op_name)
File “/tensorflow_core/python/training/slot_creator.py”, line 188, in create_zeros_slot
return create_slot_with_initializer(
File “/tensorflow_core/python/training/slot_creator.py”, line 163, in create_slot_with_initializer
return _create_slot_var(primary, initializer, “”, validate_shape, shape,
File “/tensorflow_core/python/training/slot_creator.py”, line 67, in _create_slot_var
slot = variable_scope.get_variable(
File “/tensorflow_core/python/ops/variable_scope.py”, line 1484, in get_variable
return get_variable_scope().get_variable(
File “/tensorflow_core/python/ops/variable_scope.py”, line 1227, in get_variable
return var_store.get_variable(
File “/tensorflow_core/python/ops/variable_scope.py”, line 552, in get_variable
return _true_getter(
File “/tensorflow_core/python/ops/variable_scope.py”, line 505, in _true_getter
return self._get_single_variable(
File “/tensorflow_core/python/ops/variable_scope.py”, line 922, in _get_single_variable
v = variables.VariableV1(
File “/tensorflow_core/python/ops/variables.py”, line 258, in call
return cls._variable_v1_call(*args, **kwargs)
File “/tensorflow_core/python/ops/variables.py”, line 204, in _variable_v1_call
return previous_getter(
File “/tensorflow_core/python/ops/variables.py”, line 197, in
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
File “/tensorflow_core/python/ops/variable_scope.py”, line 2505, in default_variable_creator
return variables.RefVariable(
File “/tensorflow_core/python/ops/variables.py”, line 262, in call
return super(VariableMetaclass, cls).call(*args, **kwargs)
File “/tensorflow_core/python/ops/variables.py”, line 1676, in init
self._init_from_args(
File “/tensorflow_core/python/ops/variables.py”, line 1823, in _init_from_args
self._variable = state_ops.variable_op_v2(
File “/tensorflow_core/python/ops/state_ops.py”, line 74, in variable_op_v2
return gen_state_ops.variable_v2(
File “/tensorflow_core/python/ops/gen_state_ops.py”, line 1619, in variable_v2
_, _, _op = _op_def_lib._apply_op_helper(
File “/tensorflow_core/python/framework/op_def_library.py”, line 792, in _apply_op_helper
op = g.create_op(op_type_name, inputs, dtypes=None, name=scope,
File “/tensorflow_core/python/util/deprecation.py”, line 513, in new_func
return func(*args, **kwargs)
File “/tensorflow_core/python/framework/ops.py”, line 3356, in create_op
return self._create_op_internal(op_type, inputs, dtypes, input_types, name,
File “/tensorflow_core/python/framework/ops.py”, line 3418, in _create_op_internal
ret = Operation(
File “/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]:
Execution status: FAIL
2024-07-31 08:54:21,075 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

Thanks

In the training spec file, the checkpoint will be saved every 1 epoch according to checkpoint_interval: 1
From your comments, if set totally 10 epochs, you can save all the checkpoints(should get totally 10 checkpoint files).
If set 20 epochs, epoch 12th will get failed.

So, is it possible related to the disk space in your WSL?

When I delete all checkpoints and start from scratch it mostly crashes on the second checkpoint. I can then restart from checkpoint one, but that seems to not be related with wsl disk space.

Also when I check available disk space for my Ubuntu-22.04 in WSL I get 953G:

wsl --system -d Ubuntu-22.04 df -h /mnt/wslg/distro
Filesystem Size Used Avail Use% Mounted on
/dev/sdc 1007G 2.8G 953G 1% /mnt/wslg/distro

Could you please share the $nvidia-smi under WSL environment?

This is while tao train with a batch size of 4

image

as always it died after first two batches

Please try to increase the “SWAP” Memory for WSL. I am afraid it is due to lacking of enough CPU memory for WSL system.

I increased swap to 32GB but still same problem.

just to clearify, this is the memory I have right now.
GPU RAM 16GB
WSL2 RAM: 54GB
WSL2 swap: 32GB (swap is not even used because I have enough memory)
image
WSL2 disk space: 1TB

To narrow down, can you set to 128x128 and retry?

128x128 runs through all epochs without problems

So, it seems to be related to the memory. You can try more combinations, for example, 224x128.

I tried
224x128 → passed
352x208 → passed
384x224 → failed
400x400 → failed

How about 368x208?

368x208 → passed
368x224 → failed, but at epoch 12. So it got pretty far

You can use this setting to run training. It is near the original 400x224 and also it gets similar aspect_ratio to your original 400x224.

Yes I will use that, but the solution is not really satisfying. Maybe I need to set up a Linux machine to get rid of WSL and see if that makes training possible for larger networks.

Yes, suggest to use this way.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.