Key used to load the model is incorrect

artemchepurnoy · August 8, 2025, 10:51am

• Hardware: T4 (g4dn at AWS)
• Network Type: Yolo_v4_tiny (LPDNet)
• TLT Version toolkit_version: 6.0.0 published_date: 07/11/2025
• Training spec file attached

I am trying to retrain LPDNet model v2 - yolov4_tiny_usa_trainable.tlt (unpruned_v2.1) (LPDNet | NVIDIA NGC) using tao_launcher_starter_kit yolo_v4_tiny jupyter notebook.

When launching
!tao model yolo_v4_tiny train -e $SPECS_DIR/yolo_v4_tiny_train_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
–gpus 1
–key nvidia_tlt

I receive the following error:

INFO: Invalid model: /tmp/tmp654wve7q.hdf5, please check the key used to load the model
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py", line 578, in load_keras_model
    return keras.models.load_model(filepath, custom_objects, compile=compile)
  File "/usr/local/lib/python3.8/dist-packages/keras/engine/saving.py", line 417, in load_model
    f = h5dict(filepath, 'r')
  File "/usr/local/lib/python3.8/dist-packages/keras/utils/io_utils.py", line 186, in __init__
    self.data = h5py.File(path, mode=mode)
  File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/files.py", line 312, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
  File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/files.py", line 142, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 78, in h5py.h5f.open
OSError: Unable to open file (file signature not found)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/scripts/train.py", line 165, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py", line 717, in return_func
    raise e
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py", line 705, in return_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/scripts/train.py", line 161, in main
    raise e
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/scripts/train.py", line 143, in main
    run_experiment(
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/scripts/train.py", line 84, in run_experiment
    model = build_training_pipeline(
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/models/utils.py", line 74, in build_training_pipeline
    yolov4.build_training_model(hvd)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/models/yolov4_model.py", line 480, in build_training_model
    self.load_pretrained_model(
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/models/yolov4_model.py", line 308, in load_pretrained_model
    pretrained_model = model_io.load_model(
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/utils/model_io.py", line 82, in load_model
    model = load_model(temp_file_name, experiment_spec, input_shape, None)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/utils/model_io.py", line 66, in load_model
    model = load_keras_model(model_path,
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py", line 580, in load_keras_model
    raise ValueError(
ValueError: Invalid model: /tmp/tmp654wve7q.hdf5, please check the key used to load the model

I use the key “nvidia_tlt“ according to the official model page (LPDNet | NVIDIA NGC).
I also tried “nvidia_tao“, “tlt_encode“ - they produce the same error.

yolo_v4_tiny_train_kitti.txt (2.1 KB)

Morganh · August 9, 2025, 8:00am

The .tlt file is encrypted version of hdf5 file.

Please run with old version of docker nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 and below way.

$ docker run --runtime=nvidia -it –rm -v /local/path:/docker/path nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

Then, inside the docker,

#yolo_v4_tiny train xxx

artemchepurnoy · August 11, 2025, 7:32am

I run with the old version of docker, but again got the same error:

root@d4a557191e34:/workspace/tao-experiments# yolo_v4_tiny train -e specs/yolo_v4_tiny_train_kitti.txt -r yolo_v4_tiny/experiment_dir_unpruned --gpus 1 --key nvidia_tltUsing TensorFlow backend.2025-08-11 07:29:22.243103: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them./usr/local/lib/python3.6/dist-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn’t match a supported version!RequestsDependencyWarning)Using TensorFlow backend.WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them./usr/local/lib/python3.6/dist-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn’t match a supported version!RequestsDependencyWarning)WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

INFO: Log file already exists at /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned/status.jsonINFO: Starting Yolo_V4 Training jobWARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.

INFO: Serial augmentation enabled = FalseINFO: Pseudo sharding enabled = FalseINFO: Max Image Dimensions (all sources): (0, 0)INFO: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: -1INFO: total dataset size 10000, number of sources: 1, batch size per gpu: 40, steps: 250WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:Entity <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x70418c596748>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x70418c596748>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source codeWARNING: Entity <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x70418c596748>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x70418c596748>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source codeINFO: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.INFO: shuffle: True - shard 0 of 1INFO: sampling 1 datasets with weights:INFO: source: 0 weight: 1.000000WARNING:tensorflow:Entity <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x70418c2e9438>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x70418c2e9438>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source codeWARNING: Entity <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x70418c2e9438>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x70418c2e9438>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source codeINFO: Invalid model: /tmp/tmpkc40lih2.hdf5, please check the key used to load the modelTraceback (most recent call last):File “”, line 568, in load_keras_modelFile “/usr/local/lib/python3.6/dist-packages/keras/engine/saving.py”, line 417, in load_modelf = h5dict(filepath, ‘r’)File “/usr/local/lib/python3.6/dist-packages/keras/utils/io_utils.py”, line 186, in initself.data = h5py.File(path, mode=mode)File “/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py”, line 312, in initfid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)File “/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py”, line 142, in make_fidfid = h5f.open(name, flags, fapl=fapl)File “h5py/_objects.pyx”, line 54, in h5py._objects.with_phil.wrapperFile “h5py/_objects.pyx”, line 55, in h5py._objects.with_phil.wrapperFile “h5py/h5f.pyx”, line 78, in h5py.h5f.openOSError: Unable to open file (file signature not found)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):File “</usr/local/lib/python3.6/dist-packages/iva/yolo_v4/scripts/train.py>”, line 3, in File “”, line 152, in File “”, line 707, in return_funcFile “”, line 695, in return_funcFile “”, line 148, in mainFile “”, line 133, in mainFile “”, line 78, in run_experimentFile “”, line 71, in build_training_pipelineFile “”, line 481, in build_training_modelFile “”, line 311, in load_pretrained_modelFile “”, line 70, in load_modelFile “”, line 55, in load_modelFile “”, line 571, in load_keras_modelValueError: Invalid model: /tmp/tmpkc40lih2.hdf5, please check the key used to load the modelTelemetry data couldn’t be sent, but the command ran successfully.[WARNING]: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>Execution status: FAIL

Morganh · August 11, 2025, 7:43am

Can you double check if the pretrained model exists inside the docker and also if the md5sum is correct?

#md5sum /workspace/tao-experiments/yolo_v4_tiny/yolov4_tiny_usa_trainable/yolov4_tiny_usa_trainable.tlt

artemchepurnoy · August 11, 2025, 8:04am

Yes, the model is inside the docker container. Here is the md5sum: 29b5033466906ac2fe8423269908c855.
However, I can’t find the correct md5sum value of the model on the NGC model page to compare with (LPDNet | NVIDIA NGC)

Morganh · August 11, 2025, 8:29am

I download it and its md5sum is as below.

$ md5sum yolov4_tiny_usa_trainable.tlt
a7bb9224b44b042217d2e5c24f26ec5a yolov4_tiny_usa_trainable.tlt

Please download it again.

artemchepurnoy · August 11, 2025, 12:56pm

I probably tried unpruned_v2.0 model and left it there.
After downloading the v2.1 model again the training successfully launched.

However, it fails after 5th epoch. Here is the log:

root@e9dcc224184e:/workspace/tao-experiments# yolo_v4_tiny train -e specs/yolo_v4_tiny_train_kitti.txt -r yolo_v4_tiny/experiment_dir_unpruned --gpus 1 --key nvidia_tlt
Using TensorFlow backend.
2025-08-11 11:43:49.398169: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

INFO: Log file already exists at /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned/status.json
INFO: Starting Yolo_V4 Training job
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.

INFO: Serial augmentation enabled = False
INFO: Pseudo sharding enabled = False
INFO: Max Image Dimensions (all sources): (0, 0)
INFO: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: -1
INFO: total dataset size 10000, number of sources: 1, batch size per gpu: 20, steps: 500
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:Entity <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae714042710>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae714042710>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING: Entity <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae714042710>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae714042710>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
INFO: shuffle: True - shard 0 of 1
INFO: sampling 1 datasets with weights:
INFO: source: 0 weight: 1.000000
WARNING:tensorflow:Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae6e858c438>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae6e858c438>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING: Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae6e858c438>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae6e858c438>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

INFO: Serial augmentation enabled = False
INFO: Pseudo sharding enabled = False
INFO: Max Image Dimensions (all sources): (0, 0)
INFO: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: -1
INFO: total dataset size 1591, number of sources: 1, batch size per gpu: 8, steps: 199
WARNING:tensorflow:Entity <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae615fb5470>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae615fb5470>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING: Entity <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae615fb5470>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae615fb5470>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
INFO: shuffle: False - shard 0 of 1
INFO: sampling 1 datasets with weights:
INFO: source: 0 weight: 1.000000
WARNING:tensorflow:Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae615dd8b00>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae615dd8b00>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING: Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae615dd8b00>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae615dd8b00>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO: Log file already exists at /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned/status.json
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
Input (InputLayer)              (None, 3, None, None 0
__________________________________________________________________________________________________
conv_0 (Conv2D)                 (None, 32, None, Non 864         Input[0][0]
__________________________________________________________________________________________________
conv_0_bn (BatchNormalization)  (None, 32, None, Non 128         conv_0[0][0]
__________________________________________________________________________________________________
conv_0_mish (LeakyReLU)         (None, 32, None, Non 0           conv_0_bn[0][0]
__________________________________________________________________________________________________
conv_1 (Conv2D)                 (None, 64, None, Non 18432       conv_0_mish[0][0]
__________________________________________________________________________________________________
conv_1_bn (BatchNormalization)  (None, 64, None, Non 256         conv_1[0][0]
__________________________________________________________________________________________________
conv_1_mish (LeakyReLU)         (None, 64, None, Non 0           conv_1_bn[0][0]
__________________________________________________________________________________________________
conv_2_conv_0 (Conv2D)          (None, 64, None, Non 36864       conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_2_conv_0_bn (BatchNormaliz (None, 64, None, Non 256         conv_2_conv_0[0][0]
__________________________________________________________________________________________________
conv_2_conv_0_mish (LeakyReLU)  (None, 64, None, Non 0           conv_2_conv_0_bn[0][0]
__________________________________________________________________________________________________
conv_2_split_0 (Split)          (None, 32, None, Non 0           conv_2_conv_0_mish[0][0]
__________________________________________________________________________________________________
conv_2_conv_1 (Conv2D)          (None, 32, None, Non 9216        conv_2_split_0[0][0]
__________________________________________________________________________________________________
conv_2_conv_1_bn (BatchNormaliz (None, 32, None, Non 128         conv_2_conv_1[0][0]
__________________________________________________________________________________________________
conv_2_conv_1_mish (LeakyReLU)  (None, 32, None, Non 0           conv_2_conv_1_bn[0][0]
__________________________________________________________________________________________________
conv_2_conv_2 (Conv2D)          (None, 32, None, Non 9216        conv_2_conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_2_conv_2_bn (BatchNormaliz (None, 32, None, Non 128         conv_2_conv_2[0][0]
__________________________________________________________________________________________________
conv_2_conv_2_mish (LeakyReLU)  (None, 32, None, Non 0           conv_2_conv_2_bn[0][0]
__________________________________________________________________________________________________
conv_2_concat_0 (Concatenate)   (None, 64, None, Non 0           conv_2_conv_2_mish[0][0]
                                                                 conv_2_conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_2_conv_3 (Conv2D)          (None, 64, None, Non 4096        conv_2_concat_0[0][0]
__________________________________________________________________________________________________
conv_2_conv_3_bn (BatchNormaliz (None, 64, None, Non 256         conv_2_conv_3[0][0]
__________________________________________________________________________________________________
conv_2_conv_3_mish (LeakyReLU)  (None, 64, None, Non 0           conv_2_conv_3_bn[0][0]
__________________________________________________________________________________________________
conv_2_concat_1 (Concatenate)   (None, 128, None, No 0           conv_2_conv_0_mish[0][0]
                                                                 conv_2_conv_3_mish[0][0]
__________________________________________________________________________________________________
conv_2_pool_0 (MaxPooling2D)    (None, 128, None, No 0           conv_2_concat_1[0][0]
__________________________________________________________________________________________________
conv_3_conv_0 (Conv2D)          (None, 128, None, No 147456      conv_2_pool_0[0][0]
__________________________________________________________________________________________________
conv_3_conv_0_bn (BatchNormaliz (None, 128, None, No 512         conv_3_conv_0[0][0]
__________________________________________________________________________________________________
conv_3_conv_0_mish (LeakyReLU)  (None, 128, None, No 0           conv_3_conv_0_bn[0][0]
__________________________________________________________________________________________________
conv_3_split_0 (Split)          (None, 64, None, Non 0           conv_3_conv_0_mish[0][0]
__________________________________________________________________________________________________
conv_3_conv_1 (Conv2D)          (None, 64, None, Non 36864       conv_3_split_0[0][0]
__________________________________________________________________________________________________
conv_3_conv_1_bn (BatchNormaliz (None, 64, None, Non 256         conv_3_conv_1[0][0]
__________________________________________________________________________________________________
conv_3_conv_1_mish (LeakyReLU)  (None, 64, None, Non 0           conv_3_conv_1_bn[0][0]
__________________________________________________________________________________________________
conv_3_conv_2 (Conv2D)          (None, 64, None, Non 36864       conv_3_conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_3_conv_2_bn (BatchNormaliz (None, 64, None, Non 256         conv_3_conv_2[0][0]
__________________________________________________________________________________________________
conv_3_conv_2_mish (LeakyReLU)  (None, 64, None, Non 0           conv_3_conv_2_bn[0][0]
__________________________________________________________________________________________________
conv_3_concat_0 (Concatenate)   (None, 128, None, No 0           conv_3_conv_2_mish[0][0]
                                                                 conv_3_conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_3_conv_3 (Conv2D)          (None, 128, None, No 16384       conv_3_concat_0[0][0]
__________________________________________________________________________________________________
conv_3_conv_3_bn (BatchNormaliz (None, 128, None, No 512         conv_3_conv_3[0][0]
__________________________________________________________________________________________________
conv_3_conv_3_mish (LeakyReLU)  (None, 128, None, No 0           conv_3_conv_3_bn[0][0]
__________________________________________________________________________________________________
conv_3_concat_1 (Concatenate)   (None, 256, None, No 0           conv_3_conv_0_mish[0][0]
                                                                 conv_3_conv_3_mish[0][0]
__________________________________________________________________________________________________
conv_3_pool_0 (MaxPooling2D)    (None, 256, None, No 0           conv_3_concat_1[0][0]
__________________________________________________________________________________________________
conv_4_conv_0 (Conv2D)          (None, 256, None, No 589824      conv_3_pool_0[0][0]
__________________________________________________________________________________________________
conv_4_conv_0_bn (BatchNormaliz (None, 256, None, No 1024        conv_4_conv_0[0][0]
__________________________________________________________________________________________________
conv_4_conv_0_mish (LeakyReLU)  (None, 256, None, No 0           conv_4_conv_0_bn[0][0]
__________________________________________________________________________________________________
conv_4_split_0 (Split)          (None, 128, None, No 0           conv_4_conv_0_mish[0][0]
__________________________________________________________________________________________________
conv_4_conv_1 (Conv2D)          (None, 128, None, No 147456      conv_4_split_0[0][0]
__________________________________________________________________________________________________
conv_4_conv_1_bn (BatchNormaliz (None, 128, None, No 512         conv_4_conv_1[0][0]
__________________________________________________________________________________________________
conv_4_conv_1_mish (LeakyReLU)  (None, 128, None, No 0           conv_4_conv_1_bn[0][0]
__________________________________________________________________________________________________
conv_4_conv_2 (Conv2D)          (None, 128, None, No 147456      conv_4_conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_4_conv_2_bn (BatchNormaliz (None, 128, None, No 512         conv_4_conv_2[0][0]
__________________________________________________________________________________________________
conv_4_conv_2_mish (LeakyReLU)  (None, 128, None, No 0           conv_4_conv_2_bn[0][0]
__________________________________________________________________________________________________
conv_4_concat_0 (Concatenate)   (None, 256, None, No 0           conv_4_conv_2_mish[0][0]
                                                                 conv_4_conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_4_conv_3 (Conv2D)          (None, 256, None, No 65536       conv_4_concat_0[0][0]
__________________________________________________________________________________________________
conv_4_conv_3_bn (BatchNormaliz (None, 256, None, No 1024        conv_4_conv_3[0][0]
__________________________________________________________________________________________________
conv_4_conv_3_mish (LeakyReLU)  (None, 256, None, No 0           conv_4_conv_3_bn[0][0]
__________________________________________________________________________________________________
conv_4_concat_1 (Concatenate)   (None, 512, None, No 0           conv_4_conv_0_mish[0][0]
                                                                 conv_4_conv_3_mish[0][0]
__________________________________________________________________________________________________
conv_4_pool_0 (MaxPooling2D)    (None, 512, None, No 0           conv_4_concat_1[0][0]
__________________________________________________________________________________________________
conv_5 (Conv2D)                 (None, 512, None, No 2359296     conv_4_pool_0[0][0]
__________________________________________________________________________________________________
conv_5_bn (BatchNormalization)  (None, 512, None, No 2048        conv_5[0][0]
__________________________________________________________________________________________________
conv_5_mish (LeakyReLU)         (None, 512, None, No 0           conv_5_bn[0][0]
__________________________________________________________________________________________________
yolo_conv1_1 (Conv2D)           (None, 256, None, No 131072      conv_5_mish[0][0]
__________________________________________________________________________________________________
yolo_conv1_1_bn (BatchNormaliza (None, 256, None, No 1024        yolo_conv1_1[0][0]
__________________________________________________________________________________________________
yolo_conv1_1_lrelu (LeakyReLU)  (None, 256, None, No 0           yolo_conv1_1_bn[0][0]
__________________________________________________________________________________________________
yolo_conv2 (Conv2D)             (None, 128, None, No 32768       yolo_conv1_1_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv2_bn (BatchNormalizati (None, 128, None, No 512         yolo_conv2[0][0]
__________________________________________________________________________________________________
yolo_conv2_lrelu (LeakyReLU)    (None, 128, None, No 0           yolo_conv2_bn[0][0]
__________________________________________________________________________________________________
upsample0 (UpSampling2D)        (None, 128, None, No 0           yolo_conv2_lrelu[0][0]
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (None, 384, None, No 0           upsample0[0][0]
                                                                 conv_4_conv_3_mish[0][0]
__________________________________________________________________________________________________
yolo_conv1_6 (Conv2D)           (None, 512, None, No 1179648     yolo_conv1_1_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv3_6 (Conv2D)           (None, 256, None, No 884736      concatenate_2[0][0]
__________________________________________________________________________________________________
yolo_conv1_6_bn (BatchNormaliza (None, 512, None, No 2048        yolo_conv1_6[0][0]
__________________________________________________________________________________________________
yolo_conv3_6_bn (BatchNormaliza (None, 256, None, No 1024        yolo_conv3_6[0][0]
__________________________________________________________________________________________________
yolo_conv1_6_lrelu (LeakyReLU)  (None, 512, None, No 0           yolo_conv1_6_bn[0][0]
__________________________________________________________________________________________________
yolo_conv3_6_lrelu (LeakyReLU)  (None, 256, None, No 0           yolo_conv3_6_bn[0][0]
__________________________________________________________________________________________________
conv_big_object (Conv2D)        (None, 18, None, Non 9234        yolo_conv1_6_lrelu[0][0]
__________________________________________________________________________________________________
conv_mid_object (Conv2D)        (None, 18, None, Non 4626        yolo_conv3_6_lrelu[0][0]
__________________________________________________________________________________________________
bg_permute (Permute)            (None, None, None, 1 0           conv_big_object[0][0]
__________________________________________________________________________________________________
md_permute (Permute)            (None, None, None, 1 0           conv_mid_object[0][0]
__________________________________________________________________________________________________
bg_reshape (Reshape)            (None, None, 6)      0           bg_permute[0][0]
__________________________________________________________________________________________________
md_reshape (Reshape)            (None, None, 6)      0           md_permute[0][0]
__________________________________________________________________________________________________
bg_anchor (YOLOAnchorBox)       (None, None, 6)      0           conv_big_object[0][0]
__________________________________________________________________________________________________
bg_bbox_processor (BBoxPostProc (None, None, 6)      0           bg_reshape[0][0]
__________________________________________________________________________________________________
md_anchor (YOLOAnchorBox)       (None, None, 6)      0           conv_mid_object[0][0]
__________________________________________________________________________________________________
md_bbox_processor (BBoxPostProc (None, None, 6)      0           md_reshape[0][0]
__________________________________________________________________________________________________
encoded_bg (Concatenate)        (None, None, 12)     0           bg_anchor[0][0]
                                                                 bg_bbox_processor[0][0]
__________________________________________________________________________________________________
encoded_md (Concatenate)        (None, None, 12)     0           md_anchor[0][0]
                                                                 md_bbox_processor[0][0]
__________________________________________________________________________________________________
encoded_detections (Concatenate (None, None, 12)     0           encoded_bg[0][0]
                                                                 encoded_md[0][0]
==================================================================================================
Total params: 5,880,324
Trainable params: 5,874,116
Non-trainable params: 6,208
__________________________________________________________________________________________________
INFO: Starting Training Loop.
Epoch 1/80
1250/1250 [==============================] - 972s 778ms/step - loss: 18.4968
e9dcc224184e:227:246 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
e9dcc224184e:227:246 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
e9dcc224184e:227:246 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
e9dcc224184e:227:246 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
e9dcc224184e:227:246 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
e9dcc224184e:227:246 [0] NCCL INFO cudaDriverVersion 12080
NCCL version 2.15.5+cuda11.8
e9dcc224184e:227:246 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
e9dcc224184e:227:246 [0] NCCL INFO P2P plugin IBext
e9dcc224184e:227:246 [0] NCCL INFO NET/IB : No device found.
e9dcc224184e:227:246 [0] NCCL INFO NET/IB : No device found.
e9dcc224184e:227:246 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
e9dcc224184e:227:246 [0] NCCL INFO Using network Socket
e9dcc224184e:227:246 [0] NCCL INFO Channel 00/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 01/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 02/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 03/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 04/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 05/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 06/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 07/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 08/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 09/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 10/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 11/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 12/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 13/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 14/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 15/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 16/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 17/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 18/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 19/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 20/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 21/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 22/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 23/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 24/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 25/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 26/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 27/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 28/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 29/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 30/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 31/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
e9dcc224184e:227:246 [0] NCCL INFO Connected all rings
e9dcc224184e:227:246 [0] NCCL INFO Connected all trees
e9dcc224184e:227:246 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
e9dcc224184e:227:246 [0] NCCL INFO comm 0x7ae2dc220dc0 rank 0 nranks 1 cudaDev 0 busId 1e0 - Init COMPLETE
INFO: Training loop in progress
Epoch 2/80
1250/1250 [==============================] - 751s 601ms/step - loss: 8.2688
INFO: Training loop in progress
Epoch 3/80
1250/1250 [==============================] - 679s 543ms/step - loss: 6.5264
INFO: Training loop in progress
Epoch 4/80
1250/1250 [==============================] - 620s 496ms/step - loss: 5.7025
INFO: Training loop in progress
Epoch 5/80
1249/1250 [============================>.] - ETA: 0s - loss: 4.5735Killed
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>
Execution status: FAIL

Tried it 2 times - it failed after 5th epoch in both cases.

Morganh · August 12, 2025, 8:31am

Glad to know it is working now.

artemchepurnoy:

Epoch 4/80
1250/1250 [==============================] - 620s 496ms/step - loss: 5.7025
INFO: Training loop in progress
Epoch 5/80
1249/1250 [============================>.] - ETA: 0s - loss: 4.5735Killed

Seems to be training is killed due to out-of-memory.

Which dgpu is using? Please check $nvidia-smi

Please try to use a lower batch-size.

artemchepurnoy · August 17, 2025, 5:48am

I finished training with batch-size = 4.
Thank you, @Morganh

system · August 31, 2025, 5:49am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Invalid decryption. Unable to open file xxx, The key used to load the model is incorrect TAO Toolkit ubuntu	5	797	October 9, 2021
Tlt detectnet_v2 train OSError: Invalid decryption. Unable to open file (file signature not found). The key used to load the model is incorrect TAO Toolkit	4	715	October 12, 2021
OSError: Unable to open file (file signature not found) TAO Toolkit	23	3930	October 12, 2021
IOError: Invalid decryption. Unable to open file (File signature not found) tlt-prune command TAO Toolkit	29	2274	October 12, 2021
Invalid decryption. Unable to open file (file signature not found). The key used to load the model is incorrect TAO Toolkit	3	681	October 12, 2021
Tao-converter [ERROR] Failed to parse the model, please check the encoding key to make sure its correct TAO Toolkit deepstream	70	1832	July 10, 2023
LPRNet Error on Openalpr Dataset while training TAO Toolkit	18	947	October 12, 2021
Tlt 3.0 retrained vehicletypenet, classification net error when loaded pretrained model TAO Toolkit	4	415	October 12, 2021
Error while running tao deformable_detr train TAO Toolkit	9	1350	July 6, 2023
Can't load pre-trained model for Retail Object Detection TAO Toolkit	8	807	April 14, 2023

Key used to load the model is incorrect

Related topics