To avoid NaN loss, please set the output_width >= 100. And then restart the training

neuroSparK · August 26, 2022, 1:16pm

Please provide the following information when requesting support.

• Hardware ( GTX 2070)
• Network Type (LPRnet)
• TLT Version (3.22.05)

I am trying to train a LPRnet from scratch on a different language other than english license plate images. My training specs file is like bellow:

random_seed: 42
lpr_config {
  hidden_units: 1024
  max_label_length: 12
  arch: "baseline"
  nlayers: 18 #setting nlayers to be 10 to use baseline10 model
}
training_config {
  batch_size_per_gpu: 32
  num_epochs: 200
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 1e-6
    max_learning_rate: 1e-4
    soft_start: 0.001
    annealing: 0.5
  }
  }
  regularizer {
    type: L2
    weight: 5e-4
  }
}
eval_config {
  validation_period_during_training: 5
  batch_size: 32
}
augmentation_config {
    output_width: 96
    output_height: 48
    output_channel: 3
    max_rotate_degree: 5
    rotate_prob: 0.5
    gaussian_kernel_size: 5
    gaussian_kernel_size: 7
    gaussian_kernel_size: 15
    blur_prob: 0.5
    reverse_color_prob: 0.5
    keep_original_prob: 0.3
}
dataset_config {
  data_sources: {
    label_directory_path: "/workspace/tao-experiments/data/openalpr/train/labels"
    image_directory_path: "/workspace/tao-experiments/data/openalpr/train/images"
  }
  characters_list_file: "/workspace/tao-experiments/lprnet/specs/us_lp_characters.txt"
  validation_data_sources: {
    label_directory_path: "/workspace/tao-experiments/data/openalpr/val/labels"
    image_directory_path: "/workspace/tao-experiments/data/openalpr/val/images"
  }
}

When I try to run the command:

!tao lprnet train --gpus=1 --gpu_index=$GPU_INDEX \
                  -e $SPECS_DIR/tutorial_spec.txt \
                  -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                  -k $KEY

But it gets the error:

For multi-GPU, change --gpus based on your machine.
2022-08-26 19:07:56,844 [INFO] root: Registry: ['nvcr.io']
2022-08-26 19:07:56,918 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-590wgwtr because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:60: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:60: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:63: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:63: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:64: The name tf.keras.backend.set_session is deprecated. Please use tf.compat.v1.keras.backend.set_session instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:64: The name tf.keras.backend.set_session is deprecated. Please use tf.compat.v1.keras.backend.set_session instead.

INFO: Log file already exists at /workspace/tao-experiments/lprnet/experiment_dir_unpruned/status.json
INFO: Merging specification from /workspace/tao-experiments/lprnet/specs/tutorial_spec.txt
Initialize optimizer
INFO: To avoid NaN loss, please set the output_width >= 100. And then restart the training.
INFO: Training was interrupted
INFO: Training was interrupted.
2022-08-26 19:08:10,428 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I don’t see any field on the specs file with output-width that I can change. Please help.

Morganh · August 27, 2022, 12:23pm

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

There is output_width field in the spec file.

system · September 13, 2022, 1:34am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
LPRNet issue while training using custom data TAO Toolkit	3	994	December 28, 2021
Training Failure for License Plate Detection Toturial TAO Toolkit training , tao	5	1087	October 13, 2021
Get error when training lprnet with TLT3.0 lancher TAO Toolkit	7	540	October 12, 2021
LPRNet Error on Openalpr Dataset while training TAO Toolkit	18	866	October 12, 2021
Tlt lprnet export error, TypeError: set_data_preprocessing_parameters() got an unexpected keyword argument 'image_mean' TAO Toolkit	7	1244	October 12, 2021
Error while re-training with custom dataset using tlt file- FasterRCNN TAO Toolkit	5	356	June 26, 2023
Training Spec File for Peoplenet-Resnet34 on TLT3.0 TAO Toolkit	2	605	October 12, 2021
TAO yoloV4 cannot train from checkpoint TAO Toolkit	8	395	August 5, 2022
LPRNet training and deployment TAO Toolkit	26	327	June 26, 2024
"YOLOv3DatasetConfig" has no field named "z" TAO Toolkit	2	503	December 4, 2022

To avoid NaN loss, please set the output_width >= 100. And then restart the training

Related topics