Key used to load the model is incorrect

• Hardware: T4 (g4dn at AWS)
• Network Type: Yolo_v4_tiny (LPDNet)
• TLT Version toolkit_version: 6.0.0 published_date: 07/11/2025
• Training spec file attached

I am trying to retrain LPDNet model v2 - yolov4_tiny_usa_trainable.tlt (unpruned_v2.1) (LPDNet | NVIDIA NGC) using tao_launcher_starter_kit yolo_v4_tiny jupyter notebook.

When launching
!tao model yolo_v4_tiny train -e $SPECS_DIR/yolo_v4_tiny_train_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
–gpus 1
–key nvidia_tlt

I receive the following error:

INFO: Invalid model: /tmp/tmp654wve7q.hdf5, please check the key used to load the model
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py", line 578, in load_keras_model
    return keras.models.load_model(filepath, custom_objects, compile=compile)
  File "/usr/local/lib/python3.8/dist-packages/keras/engine/saving.py", line 417, in load_model
    f = h5dict(filepath, 'r')
  File "/usr/local/lib/python3.8/dist-packages/keras/utils/io_utils.py", line 186, in __init__
    self.data = h5py.File(path, mode=mode)
  File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/files.py", line 312, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
  File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/files.py", line 142, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 78, in h5py.h5f.open
OSError: Unable to open file (file signature not found)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/scripts/train.py", line 165, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py", line 717, in return_func
    raise e
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py", line 705, in return_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/scripts/train.py", line 161, in main
    raise e
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/scripts/train.py", line 143, in main
    run_experiment(
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/scripts/train.py", line 84, in run_experiment
    model = build_training_pipeline(
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/models/utils.py", line 74, in build_training_pipeline
    yolov4.build_training_model(hvd)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/models/yolov4_model.py", line 480, in build_training_model
    self.load_pretrained_model(
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/models/yolov4_model.py", line 308, in load_pretrained_model
    pretrained_model = model_io.load_model(
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/utils/model_io.py", line 82, in load_model
    model = load_model(temp_file_name, experiment_spec, input_shape, None)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/utils/model_io.py", line 66, in load_model
    model = load_keras_model(model_path,
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py", line 580, in load_keras_model
    raise ValueError(
ValueError: Invalid model: /tmp/tmp654wve7q.hdf5, please check the key used to load the model

I use the key “nvidia_tlt“ according to the official model page (LPDNet | NVIDIA NGC).
I also tried “nvidia_tao“, “tlt_encode“ - they produce the same error.

yolo_v4_tiny_train_kitti.txt (2.1 KB)

The .tlt file is encrypted version of hdf5 file.

Please run with old version of docker nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 and below way.

$ docker run --runtime=nvidia -it –rm -v /local/path:/docker/path nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

Then, inside the docker,

#yolo_v4_tiny train xxx

I run with the old version of docker, but again got the same error:

root@d4a557191e34:/workspace/tao-experiments# yolo_v4_tiny train -e specs/yolo_v4_tiny_train_kitti.txt -r yolo_v4_tiny/experiment_dir_unpruned --gpus 1 --key nvidia_tltUsing TensorFlow backend.2025-08-11 07:29:22.243103: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them./usr/local/lib/python3.6/dist-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn’t match a supported version!RequestsDependencyWarning)Using TensorFlow backend.WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them./usr/local/lib/python3.6/dist-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn’t match a supported version!RequestsDependencyWarning)WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

INFO: Log file already exists at /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned/status.jsonINFO: Starting Yolo_V4 Training jobWARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.

INFO: Serial augmentation enabled = FalseINFO: Pseudo sharding enabled = FalseINFO: Max Image Dimensions (all sources): (0, 0)INFO: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: -1INFO: total dataset size 10000, number of sources: 1, batch size per gpu: 40, steps: 250WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:Entity <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x70418c596748>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x70418c596748>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source codeWARNING: Entity <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x70418c596748>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.call of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x70418c596748>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source codeINFO: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.INFO: shuffle: True - shard 0 of 1INFO: sampling 1 datasets with weights:INFO: source: 0 weight: 1.000000WARNING:tensorflow:Entity <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x70418c2e9438>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x70418c2e9438>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source codeWARNING: Entity <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x70418c2e9438>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.call of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x70418c2e9438>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source codeINFO: Invalid model: /tmp/tmpkc40lih2.hdf5, please check the key used to load the modelTraceback (most recent call last):File “”, line 568, in load_keras_modelFile “/usr/local/lib/python3.6/dist-packages/keras/engine/saving.py”, line 417, in load_modelf = h5dict(filepath, ‘r’)File “/usr/local/lib/python3.6/dist-packages/keras/utils/io_utils.py”, line 186, in initself.data = h5py.File(path, mode=mode)File “/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py”, line 312, in initfid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)File “/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py”, line 142, in make_fidfid = h5f.open(name, flags, fapl=fapl)File “h5py/_objects.pyx”, line 54, in h5py._objects.with_phil.wrapperFile “h5py/_objects.pyx”, line 55, in h5py._objects.with_phil.wrapperFile “h5py/h5f.pyx”, line 78, in h5py.h5f.openOSError: Unable to open file (file signature not found)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):File “</usr/local/lib/python3.6/dist-packages/iva/yolo_v4/scripts/train.py>”, line 3, in File “”, line 152, in File “”, line 707, in return_funcFile “”, line 695, in return_funcFile “”, line 148, in mainFile “”, line 133, in mainFile “”, line 78, in run_experimentFile “”, line 71, in build_training_pipelineFile “”, line 481, in build_training_modelFile “”, line 311, in load_pretrained_modelFile “”, line 70, in load_modelFile “”, line 55, in load_modelFile “”, line 571, in load_keras_modelValueError: Invalid model: /tmp/tmpkc40lih2.hdf5, please check the key used to load the modelTelemetry data couldn’t be sent, but the command ran successfully.[WARNING]: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>Execution status: FAIL

Can you double check if the pretrained model exists inside the docker and also if the md5sum is correct?

#md5sum /workspace/tao-experiments/yolo_v4_tiny/yolov4_tiny_usa_trainable/yolov4_tiny_usa_trainable.tlt

Yes, the model is inside the docker container. Here is the md5sum: 29b5033466906ac2fe8423269908c855.
However, I can’t find the correct md5sum value of the model on the NGC model page to compare with (LPDNet | NVIDIA NGC)

I download it and its md5sum is as below.

$ md5sum yolov4_tiny_usa_trainable.tlt
a7bb9224b44b042217d2e5c24f26ec5a yolov4_tiny_usa_trainable.tlt

Please download it again.

I probably tried unpruned_v2.0 model and left it there.
After downloading the v2.1 model again the training successfully launched.

However, it fails after 5th epoch. Here is the log:

root@e9dcc224184e:/workspace/tao-experiments# yolo_v4_tiny train -e specs/yolo_v4_tiny_train_kitti.txt -r yolo_v4_tiny/experiment_dir_unpruned --gpus 1 --key nvidia_tlt
Using TensorFlow backend.
2025-08-11 11:43:49.398169: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

INFO: Log file already exists at /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned/status.json
INFO: Starting Yolo_V4 Training job
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.

INFO: Serial augmentation enabled = False
INFO: Pseudo sharding enabled = False
INFO: Max Image Dimensions (all sources): (0, 0)
INFO: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: -1
INFO: total dataset size 10000, number of sources: 1, batch size per gpu: 20, steps: 500
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:Entity <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae714042710>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae714042710>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING: Entity <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae714042710>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae714042710>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
INFO: shuffle: True - shard 0 of 1
INFO: sampling 1 datasets with weights:
INFO: source: 0 weight: 1.000000
WARNING:tensorflow:Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae6e858c438>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae6e858c438>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING: Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae6e858c438>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae6e858c438>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

INFO: Serial augmentation enabled = False
INFO: Pseudo sharding enabled = False
INFO: Max Image Dimensions (all sources): (0, 0)
INFO: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: -1
INFO: total dataset size 1591, number of sources: 1, batch size per gpu: 8, steps: 199
WARNING:tensorflow:Entity <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae615fb5470>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae615fb5470>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING: Entity <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae615fb5470>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae615fb5470>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
INFO: shuffle: False - shard 0 of 1
INFO: sampling 1 datasets with weights:
INFO: source: 0 weight: 1.000000
WARNING:tensorflow:Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae615dd8b00>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae615dd8b00>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING: Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae615dd8b00>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae615dd8b00>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO: Log file already exists at /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned/status.json
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
Input (InputLayer)              (None, 3, None, None 0
__________________________________________________________________________________________________
conv_0 (Conv2D)                 (None, 32, None, Non 864         Input[0][0]
__________________________________________________________________________________________________
conv_0_bn (BatchNormalization)  (None, 32, None, Non 128         conv_0[0][0]
__________________________________________________________________________________________________
conv_0_mish (LeakyReLU)         (None, 32, None, Non 0           conv_0_bn[0][0]
__________________________________________________________________________________________________
conv_1 (Conv2D)                 (None, 64, None, Non 18432       conv_0_mish[0][0]
__________________________________________________________________________________________________
conv_1_bn (BatchNormalization)  (None, 64, None, Non 256         conv_1[0][0]
__________________________________________________________________________________________________
conv_1_mish (LeakyReLU)         (None, 64, None, Non 0           conv_1_bn[0][0]
__________________________________________________________________________________________________
conv_2_conv_0 (Conv2D)          (None, 64, None, Non 36864       conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_2_conv_0_bn (BatchNormaliz (None, 64, None, Non 256         conv_2_conv_0[0][0]
__________________________________________________________________________________________________
conv_2_conv_0_mish (LeakyReLU)  (None, 64, None, Non 0           conv_2_conv_0_bn[0][0]
__________________________________________________________________________________________________
conv_2_split_0 (Split)          (None, 32, None, Non 0           conv_2_conv_0_mish[0][0]
__________________________________________________________________________________________________
conv_2_conv_1 (Conv2D)          (None, 32, None, Non 9216        conv_2_split_0[0][0]
__________________________________________________________________________________________________
conv_2_conv_1_bn (BatchNormaliz (None, 32, None, Non 128         conv_2_conv_1[0][0]
__________________________________________________________________________________________________
conv_2_conv_1_mish (LeakyReLU)  (None, 32, None, Non 0           conv_2_conv_1_bn[0][0]
__________________________________________________________________________________________________
conv_2_conv_2 (Conv2D)          (None, 32, None, Non 9216        conv_2_conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_2_conv_2_bn (BatchNormaliz (None, 32, None, Non 128         conv_2_conv_2[0][0]
__________________________________________________________________________________________________
conv_2_conv_2_mish (LeakyReLU)  (None, 32, None, Non 0           conv_2_conv_2_bn[0][0]
__________________________________________________________________________________________________
conv_2_concat_0 (Concatenate)   (None, 64, None, Non 0           conv_2_conv_2_mish[0][0]
                                                                 conv_2_conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_2_conv_3 (Conv2D)          (None, 64, None, Non 4096        conv_2_concat_0[0][0]
__________________________________________________________________________________________________
conv_2_conv_3_bn (BatchNormaliz (None, 64, None, Non 256         conv_2_conv_3[0][0]
__________________________________________________________________________________________________
conv_2_conv_3_mish (LeakyReLU)  (None, 64, None, Non 0           conv_2_conv_3_bn[0][0]
__________________________________________________________________________________________________
conv_2_concat_1 (Concatenate)   (None, 128, None, No 0           conv_2_conv_0_mish[0][0]
                                                                 conv_2_conv_3_mish[0][0]
__________________________________________________________________________________________________
conv_2_pool_0 (MaxPooling2D)    (None, 128, None, No 0           conv_2_concat_1[0][0]
__________________________________________________________________________________________________
conv_3_conv_0 (Conv2D)          (None, 128, None, No 147456      conv_2_pool_0[0][0]
__________________________________________________________________________________________________
conv_3_conv_0_bn (BatchNormaliz (None, 128, None, No 512         conv_3_conv_0[0][0]
__________________________________________________________________________________________________
conv_3_conv_0_mish (LeakyReLU)  (None, 128, None, No 0           conv_3_conv_0_bn[0][0]
__________________________________________________________________________________________________
conv_3_split_0 (Split)          (None, 64, None, Non 0           conv_3_conv_0_mish[0][0]
__________________________________________________________________________________________________
conv_3_conv_1 (Conv2D)          (None, 64, None, Non 36864       conv_3_split_0[0][0]
__________________________________________________________________________________________________
conv_3_conv_1_bn (BatchNormaliz (None, 64, None, Non 256         conv_3_conv_1[0][0]
__________________________________________________________________________________________________
conv_3_conv_1_mish (LeakyReLU)  (None, 64, None, Non 0           conv_3_conv_1_bn[0][0]
__________________________________________________________________________________________________
conv_3_conv_2 (Conv2D)          (None, 64, None, Non 36864       conv_3_conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_3_conv_2_bn (BatchNormaliz (None, 64, None, Non 256         conv_3_conv_2[0][0]
__________________________________________________________________________________________________
conv_3_conv_2_mish (LeakyReLU)  (None, 64, None, Non 0           conv_3_conv_2_bn[0][0]
__________________________________________________________________________________________________
conv_3_concat_0 (Concatenate)   (None, 128, None, No 0           conv_3_conv_2_mish[0][0]
                                                                 conv_3_conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_3_conv_3 (Conv2D)          (None, 128, None, No 16384       conv_3_concat_0[0][0]
__________________________________________________________________________________________________
conv_3_conv_3_bn (BatchNormaliz (None, 128, None, No 512         conv_3_conv_3[0][0]
__________________________________________________________________________________________________
conv_3_conv_3_mish (LeakyReLU)  (None, 128, None, No 0           conv_3_conv_3_bn[0][0]
__________________________________________________________________________________________________
conv_3_concat_1 (Concatenate)   (None, 256, None, No 0           conv_3_conv_0_mish[0][0]
                                                                 conv_3_conv_3_mish[0][0]
__________________________________________________________________________________________________
conv_3_pool_0 (MaxPooling2D)    (None, 256, None, No 0           conv_3_concat_1[0][0]
__________________________________________________________________________________________________
conv_4_conv_0 (Conv2D)          (None, 256, None, No 589824      conv_3_pool_0[0][0]
__________________________________________________________________________________________________
conv_4_conv_0_bn (BatchNormaliz (None, 256, None, No 1024        conv_4_conv_0[0][0]
__________________________________________________________________________________________________
conv_4_conv_0_mish (LeakyReLU)  (None, 256, None, No 0           conv_4_conv_0_bn[0][0]
__________________________________________________________________________________________________
conv_4_split_0 (Split)          (None, 128, None, No 0           conv_4_conv_0_mish[0][0]
__________________________________________________________________________________________________
conv_4_conv_1 (Conv2D)          (None, 128, None, No 147456      conv_4_split_0[0][0]
__________________________________________________________________________________________________
conv_4_conv_1_bn (BatchNormaliz (None, 128, None, No 512         conv_4_conv_1[0][0]
__________________________________________________________________________________________________
conv_4_conv_1_mish (LeakyReLU)  (None, 128, None, No 0           conv_4_conv_1_bn[0][0]
__________________________________________________________________________________________________
conv_4_conv_2 (Conv2D)          (None, 128, None, No 147456      conv_4_conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_4_conv_2_bn (BatchNormaliz (None, 128, None, No 512         conv_4_conv_2[0][0]
__________________________________________________________________________________________________
conv_4_conv_2_mish (LeakyReLU)  (None, 128, None, No 0           conv_4_conv_2_bn[0][0]
__________________________________________________________________________________________________
conv_4_concat_0 (Concatenate)   (None, 256, None, No 0           conv_4_conv_2_mish[0][0]
                                                                 conv_4_conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_4_conv_3 (Conv2D)          (None, 256, None, No 65536       conv_4_concat_0[0][0]
__________________________________________________________________________________________________
conv_4_conv_3_bn (BatchNormaliz (None, 256, None, No 1024        conv_4_conv_3[0][0]
__________________________________________________________________________________________________
conv_4_conv_3_mish (LeakyReLU)  (None, 256, None, No 0           conv_4_conv_3_bn[0][0]
__________________________________________________________________________________________________
conv_4_concat_1 (Concatenate)   (None, 512, None, No 0           conv_4_conv_0_mish[0][0]
                                                                 conv_4_conv_3_mish[0][0]
__________________________________________________________________________________________________
conv_4_pool_0 (MaxPooling2D)    (None, 512, None, No 0           conv_4_concat_1[0][0]
__________________________________________________________________________________________________
conv_5 (Conv2D)                 (None, 512, None, No 2359296     conv_4_pool_0[0][0]
__________________________________________________________________________________________________
conv_5_bn (BatchNormalization)  (None, 512, None, No 2048        conv_5[0][0]
__________________________________________________________________________________________________
conv_5_mish (LeakyReLU)         (None, 512, None, No 0           conv_5_bn[0][0]
__________________________________________________________________________________________________
yolo_conv1_1 (Conv2D)           (None, 256, None, No 131072      conv_5_mish[0][0]
__________________________________________________________________________________________________
yolo_conv1_1_bn (BatchNormaliza (None, 256, None, No 1024        yolo_conv1_1[0][0]
__________________________________________________________________________________________________
yolo_conv1_1_lrelu (LeakyReLU)  (None, 256, None, No 0           yolo_conv1_1_bn[0][0]
__________________________________________________________________________________________________
yolo_conv2 (Conv2D)             (None, 128, None, No 32768       yolo_conv1_1_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv2_bn (BatchNormalizati (None, 128, None, No 512         yolo_conv2[0][0]
__________________________________________________________________________________________________
yolo_conv2_lrelu (LeakyReLU)    (None, 128, None, No 0           yolo_conv2_bn[0][0]
__________________________________________________________________________________________________
upsample0 (UpSampling2D)        (None, 128, None, No 0           yolo_conv2_lrelu[0][0]
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (None, 384, None, No 0           upsample0[0][0]
                                                                 conv_4_conv_3_mish[0][0]
__________________________________________________________________________________________________
yolo_conv1_6 (Conv2D)           (None, 512, None, No 1179648     yolo_conv1_1_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv3_6 (Conv2D)           (None, 256, None, No 884736      concatenate_2[0][0]
__________________________________________________________________________________________________
yolo_conv1_6_bn (BatchNormaliza (None, 512, None, No 2048        yolo_conv1_6[0][0]
__________________________________________________________________________________________________
yolo_conv3_6_bn (BatchNormaliza (None, 256, None, No 1024        yolo_conv3_6[0][0]
__________________________________________________________________________________________________
yolo_conv1_6_lrelu (LeakyReLU)  (None, 512, None, No 0           yolo_conv1_6_bn[0][0]
__________________________________________________________________________________________________
yolo_conv3_6_lrelu (LeakyReLU)  (None, 256, None, No 0           yolo_conv3_6_bn[0][0]
__________________________________________________________________________________________________
conv_big_object (Conv2D)        (None, 18, None, Non 9234        yolo_conv1_6_lrelu[0][0]
__________________________________________________________________________________________________
conv_mid_object (Conv2D)        (None, 18, None, Non 4626        yolo_conv3_6_lrelu[0][0]
__________________________________________________________________________________________________
bg_permute (Permute)            (None, None, None, 1 0           conv_big_object[0][0]
__________________________________________________________________________________________________
md_permute (Permute)            (None, None, None, 1 0           conv_mid_object[0][0]
__________________________________________________________________________________________________
bg_reshape (Reshape)            (None, None, 6)      0           bg_permute[0][0]
__________________________________________________________________________________________________
md_reshape (Reshape)            (None, None, 6)      0           md_permute[0][0]
__________________________________________________________________________________________________
bg_anchor (YOLOAnchorBox)       (None, None, 6)      0           conv_big_object[0][0]
__________________________________________________________________________________________________
bg_bbox_processor (BBoxPostProc (None, None, 6)      0           bg_reshape[0][0]
__________________________________________________________________________________________________
md_anchor (YOLOAnchorBox)       (None, None, 6)      0           conv_mid_object[0][0]
__________________________________________________________________________________________________
md_bbox_processor (BBoxPostProc (None, None, 6)      0           md_reshape[0][0]
__________________________________________________________________________________________________
encoded_bg (Concatenate)        (None, None, 12)     0           bg_anchor[0][0]
                                                                 bg_bbox_processor[0][0]
__________________________________________________________________________________________________
encoded_md (Concatenate)        (None, None, 12)     0           md_anchor[0][0]
                                                                 md_bbox_processor[0][0]
__________________________________________________________________________________________________
encoded_detections (Concatenate (None, None, 12)     0           encoded_bg[0][0]
                                                                 encoded_md[0][0]
==================================================================================================
Total params: 5,880,324
Trainable params: 5,874,116
Non-trainable params: 6,208
__________________________________________________________________________________________________
INFO: Starting Training Loop.
Epoch 1/80
1250/1250 [==============================] - 972s 778ms/step - loss: 18.4968
e9dcc224184e:227:246 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
e9dcc224184e:227:246 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
e9dcc224184e:227:246 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
e9dcc224184e:227:246 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
e9dcc224184e:227:246 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
e9dcc224184e:227:246 [0] NCCL INFO cudaDriverVersion 12080
NCCL version 2.15.5+cuda11.8
e9dcc224184e:227:246 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
e9dcc224184e:227:246 [0] NCCL INFO P2P plugin IBext
e9dcc224184e:227:246 [0] NCCL INFO NET/IB : No device found.
e9dcc224184e:227:246 [0] NCCL INFO NET/IB : No device found.
e9dcc224184e:227:246 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
e9dcc224184e:227:246 [0] NCCL INFO Using network Socket
e9dcc224184e:227:246 [0] NCCL INFO Channel 00/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 01/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 02/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 03/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 04/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 05/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 06/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 07/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 08/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 09/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 10/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 11/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 12/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 13/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 14/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 15/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 16/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 17/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 18/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 19/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 20/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 21/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 22/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 23/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 24/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 25/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 26/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 27/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 28/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 29/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 30/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Channel 31/32 :    0
e9dcc224184e:227:246 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
e9dcc224184e:227:246 [0] NCCL INFO Connected all rings
e9dcc224184e:227:246 [0] NCCL INFO Connected all trees
e9dcc224184e:227:246 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
e9dcc224184e:227:246 [0] NCCL INFO comm 0x7ae2dc220dc0 rank 0 nranks 1 cudaDev 0 busId 1e0 - Init COMPLETE
INFO: Training loop in progress
Epoch 2/80
1250/1250 [==============================] - 751s 601ms/step - loss: 8.2688
INFO: Training loop in progress
Epoch 3/80
1250/1250 [==============================] - 679s 543ms/step - loss: 6.5264
INFO: Training loop in progress
Epoch 4/80
1250/1250 [==============================] - 620s 496ms/step - loss: 5.7025
INFO: Training loop in progress
Epoch 5/80
1249/1250 [============================>.] - ETA: 0s - loss: 4.5735Killed
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>
Execution status: FAIL

Tried it 2 times - it failed after 5th epoch in both cases.

Glad to know it is working now.

Seems to be training is killed due to out-of-memory.

Which dgpu is using? Please check $nvidia-smi

Please try to use a lower batch-size.

I finished training with batch-size = 4.
Thank you, @Morganh

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.