I probably tried unpruned_v2.0 model and left it there.
After downloading the v2.1 model again the training successfully launched.
However, it fails after 5th epoch. Here is the log:
root@e9dcc224184e:/workspace/tao-experiments# yolo_v4_tiny train -e specs/yolo_v4_tiny_train_kitti.txt -r yolo_v4_tiny/experiment_dir_unpruned --gpus 1 --key nvidia_tlt
Using TensorFlow backend.
2025-08-11 11:43:49.398169: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
RequestsDependencyWarning)
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
RequestsDependencyWarning)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
INFO: Log file already exists at /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned/status.json
INFO: Starting Yolo_V4 Training job
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.
WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.
WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.
WARNING: From /usr/local/lib/python3.6/dist-packages/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.
WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.
INFO: Serial augmentation enabled = False
INFO: Pseudo sharding enabled = False
INFO: Max Image Dimensions (all sources): (0, 0)
INFO: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: -1
INFO: total dataset size 10000, number of sources: 1, batch size per gpu: 20, steps: 500
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.
WARNING: From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.
WARNING:tensorflow:Entity <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae714042710>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae714042710>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING: Entity <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae714042710>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae714042710>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
INFO: shuffle: True - shard 0 of 1
INFO: sampling 1 datasets with weights:
INFO: source: 0 weight: 1.000000
WARNING:tensorflow:Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae6e858c438>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae6e858c438>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING: Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae6e858c438>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae6e858c438>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.
WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.
WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.
WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
WARNING: From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.
WARNING: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.
INFO: Serial augmentation enabled = False
INFO: Pseudo sharding enabled = False
INFO: Max Image Dimensions (all sources): (0, 0)
INFO: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: -1
INFO: total dataset size 1591, number of sources: 1, batch size per gpu: 8, steps: 199
WARNING:tensorflow:Entity <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae615fb5470>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae615fb5470>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING: Entity <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae615fb5470>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method YOLOv3TFRecordsParser.__call__ of <iva.yolo_v3.data_loader.yolo_v3_data_loader.YOLOv3TFRecordsParser object at 0x7ae615fb5470>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
INFO: shuffle: False - shard 0 of 1
INFO: sampling 1 datasets with weights:
INFO: source: 0 weight: 1.000000
WARNING:tensorflow:Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae615dd8b00>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae615dd8b00>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING: Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae615dd8b00>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ae615dd8b00>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO: Log file already exists at /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned/status.json
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
Input (InputLayer) (None, 3, None, None 0
__________________________________________________________________________________________________
conv_0 (Conv2D) (None, 32, None, Non 864 Input[0][0]
__________________________________________________________________________________________________
conv_0_bn (BatchNormalization) (None, 32, None, Non 128 conv_0[0][0]
__________________________________________________________________________________________________
conv_0_mish (LeakyReLU) (None, 32, None, Non 0 conv_0_bn[0][0]
__________________________________________________________________________________________________
conv_1 (Conv2D) (None, 64, None, Non 18432 conv_0_mish[0][0]
__________________________________________________________________________________________________
conv_1_bn (BatchNormalization) (None, 64, None, Non 256 conv_1[0][0]
__________________________________________________________________________________________________
conv_1_mish (LeakyReLU) (None, 64, None, Non 0 conv_1_bn[0][0]
__________________________________________________________________________________________________
conv_2_conv_0 (Conv2D) (None, 64, None, Non 36864 conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_2_conv_0_bn (BatchNormaliz (None, 64, None, Non 256 conv_2_conv_0[0][0]
__________________________________________________________________________________________________
conv_2_conv_0_mish (LeakyReLU) (None, 64, None, Non 0 conv_2_conv_0_bn[0][0]
__________________________________________________________________________________________________
conv_2_split_0 (Split) (None, 32, None, Non 0 conv_2_conv_0_mish[0][0]
__________________________________________________________________________________________________
conv_2_conv_1 (Conv2D) (None, 32, None, Non 9216 conv_2_split_0[0][0]
__________________________________________________________________________________________________
conv_2_conv_1_bn (BatchNormaliz (None, 32, None, Non 128 conv_2_conv_1[0][0]
__________________________________________________________________________________________________
conv_2_conv_1_mish (LeakyReLU) (None, 32, None, Non 0 conv_2_conv_1_bn[0][0]
__________________________________________________________________________________________________
conv_2_conv_2 (Conv2D) (None, 32, None, Non 9216 conv_2_conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_2_conv_2_bn (BatchNormaliz (None, 32, None, Non 128 conv_2_conv_2[0][0]
__________________________________________________________________________________________________
conv_2_conv_2_mish (LeakyReLU) (None, 32, None, Non 0 conv_2_conv_2_bn[0][0]
__________________________________________________________________________________________________
conv_2_concat_0 (Concatenate) (None, 64, None, Non 0 conv_2_conv_2_mish[0][0]
conv_2_conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_2_conv_3 (Conv2D) (None, 64, None, Non 4096 conv_2_concat_0[0][0]
__________________________________________________________________________________________________
conv_2_conv_3_bn (BatchNormaliz (None, 64, None, Non 256 conv_2_conv_3[0][0]
__________________________________________________________________________________________________
conv_2_conv_3_mish (LeakyReLU) (None, 64, None, Non 0 conv_2_conv_3_bn[0][0]
__________________________________________________________________________________________________
conv_2_concat_1 (Concatenate) (None, 128, None, No 0 conv_2_conv_0_mish[0][0]
conv_2_conv_3_mish[0][0]
__________________________________________________________________________________________________
conv_2_pool_0 (MaxPooling2D) (None, 128, None, No 0 conv_2_concat_1[0][0]
__________________________________________________________________________________________________
conv_3_conv_0 (Conv2D) (None, 128, None, No 147456 conv_2_pool_0[0][0]
__________________________________________________________________________________________________
conv_3_conv_0_bn (BatchNormaliz (None, 128, None, No 512 conv_3_conv_0[0][0]
__________________________________________________________________________________________________
conv_3_conv_0_mish (LeakyReLU) (None, 128, None, No 0 conv_3_conv_0_bn[0][0]
__________________________________________________________________________________________________
conv_3_split_0 (Split) (None, 64, None, Non 0 conv_3_conv_0_mish[0][0]
__________________________________________________________________________________________________
conv_3_conv_1 (Conv2D) (None, 64, None, Non 36864 conv_3_split_0[0][0]
__________________________________________________________________________________________________
conv_3_conv_1_bn (BatchNormaliz (None, 64, None, Non 256 conv_3_conv_1[0][0]
__________________________________________________________________________________________________
conv_3_conv_1_mish (LeakyReLU) (None, 64, None, Non 0 conv_3_conv_1_bn[0][0]
__________________________________________________________________________________________________
conv_3_conv_2 (Conv2D) (None, 64, None, Non 36864 conv_3_conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_3_conv_2_bn (BatchNormaliz (None, 64, None, Non 256 conv_3_conv_2[0][0]
__________________________________________________________________________________________________
conv_3_conv_2_mish (LeakyReLU) (None, 64, None, Non 0 conv_3_conv_2_bn[0][0]
__________________________________________________________________________________________________
conv_3_concat_0 (Concatenate) (None, 128, None, No 0 conv_3_conv_2_mish[0][0]
conv_3_conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_3_conv_3 (Conv2D) (None, 128, None, No 16384 conv_3_concat_0[0][0]
__________________________________________________________________________________________________
conv_3_conv_3_bn (BatchNormaliz (None, 128, None, No 512 conv_3_conv_3[0][0]
__________________________________________________________________________________________________
conv_3_conv_3_mish (LeakyReLU) (None, 128, None, No 0 conv_3_conv_3_bn[0][0]
__________________________________________________________________________________________________
conv_3_concat_1 (Concatenate) (None, 256, None, No 0 conv_3_conv_0_mish[0][0]
conv_3_conv_3_mish[0][0]
__________________________________________________________________________________________________
conv_3_pool_0 (MaxPooling2D) (None, 256, None, No 0 conv_3_concat_1[0][0]
__________________________________________________________________________________________________
conv_4_conv_0 (Conv2D) (None, 256, None, No 589824 conv_3_pool_0[0][0]
__________________________________________________________________________________________________
conv_4_conv_0_bn (BatchNormaliz (None, 256, None, No 1024 conv_4_conv_0[0][0]
__________________________________________________________________________________________________
conv_4_conv_0_mish (LeakyReLU) (None, 256, None, No 0 conv_4_conv_0_bn[0][0]
__________________________________________________________________________________________________
conv_4_split_0 (Split) (None, 128, None, No 0 conv_4_conv_0_mish[0][0]
__________________________________________________________________________________________________
conv_4_conv_1 (Conv2D) (None, 128, None, No 147456 conv_4_split_0[0][0]
__________________________________________________________________________________________________
conv_4_conv_1_bn (BatchNormaliz (None, 128, None, No 512 conv_4_conv_1[0][0]
__________________________________________________________________________________________________
conv_4_conv_1_mish (LeakyReLU) (None, 128, None, No 0 conv_4_conv_1_bn[0][0]
__________________________________________________________________________________________________
conv_4_conv_2 (Conv2D) (None, 128, None, No 147456 conv_4_conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_4_conv_2_bn (BatchNormaliz (None, 128, None, No 512 conv_4_conv_2[0][0]
__________________________________________________________________________________________________
conv_4_conv_2_mish (LeakyReLU) (None, 128, None, No 0 conv_4_conv_2_bn[0][0]
__________________________________________________________________________________________________
conv_4_concat_0 (Concatenate) (None, 256, None, No 0 conv_4_conv_2_mish[0][0]
conv_4_conv_1_mish[0][0]
__________________________________________________________________________________________________
conv_4_conv_3 (Conv2D) (None, 256, None, No 65536 conv_4_concat_0[0][0]
__________________________________________________________________________________________________
conv_4_conv_3_bn (BatchNormaliz (None, 256, None, No 1024 conv_4_conv_3[0][0]
__________________________________________________________________________________________________
conv_4_conv_3_mish (LeakyReLU) (None, 256, None, No 0 conv_4_conv_3_bn[0][0]
__________________________________________________________________________________________________
conv_4_concat_1 (Concatenate) (None, 512, None, No 0 conv_4_conv_0_mish[0][0]
conv_4_conv_3_mish[0][0]
__________________________________________________________________________________________________
conv_4_pool_0 (MaxPooling2D) (None, 512, None, No 0 conv_4_concat_1[0][0]
__________________________________________________________________________________________________
conv_5 (Conv2D) (None, 512, None, No 2359296 conv_4_pool_0[0][0]
__________________________________________________________________________________________________
conv_5_bn (BatchNormalization) (None, 512, None, No 2048 conv_5[0][0]
__________________________________________________________________________________________________
conv_5_mish (LeakyReLU) (None, 512, None, No 0 conv_5_bn[0][0]
__________________________________________________________________________________________________
yolo_conv1_1 (Conv2D) (None, 256, None, No 131072 conv_5_mish[0][0]
__________________________________________________________________________________________________
yolo_conv1_1_bn (BatchNormaliza (None, 256, None, No 1024 yolo_conv1_1[0][0]
__________________________________________________________________________________________________
yolo_conv1_1_lrelu (LeakyReLU) (None, 256, None, No 0 yolo_conv1_1_bn[0][0]
__________________________________________________________________________________________________
yolo_conv2 (Conv2D) (None, 128, None, No 32768 yolo_conv1_1_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv2_bn (BatchNormalizati (None, 128, None, No 512 yolo_conv2[0][0]
__________________________________________________________________________________________________
yolo_conv2_lrelu (LeakyReLU) (None, 128, None, No 0 yolo_conv2_bn[0][0]
__________________________________________________________________________________________________
upsample0 (UpSampling2D) (None, 128, None, No 0 yolo_conv2_lrelu[0][0]
__________________________________________________________________________________________________
concatenate_2 (Concatenate) (None, 384, None, No 0 upsample0[0][0]
conv_4_conv_3_mish[0][0]
__________________________________________________________________________________________________
yolo_conv1_6 (Conv2D) (None, 512, None, No 1179648 yolo_conv1_1_lrelu[0][0]
__________________________________________________________________________________________________
yolo_conv3_6 (Conv2D) (None, 256, None, No 884736 concatenate_2[0][0]
__________________________________________________________________________________________________
yolo_conv1_6_bn (BatchNormaliza (None, 512, None, No 2048 yolo_conv1_6[0][0]
__________________________________________________________________________________________________
yolo_conv3_6_bn (BatchNormaliza (None, 256, None, No 1024 yolo_conv3_6[0][0]
__________________________________________________________________________________________________
yolo_conv1_6_lrelu (LeakyReLU) (None, 512, None, No 0 yolo_conv1_6_bn[0][0]
__________________________________________________________________________________________________
yolo_conv3_6_lrelu (LeakyReLU) (None, 256, None, No 0 yolo_conv3_6_bn[0][0]
__________________________________________________________________________________________________
conv_big_object (Conv2D) (None, 18, None, Non 9234 yolo_conv1_6_lrelu[0][0]
__________________________________________________________________________________________________
conv_mid_object (Conv2D) (None, 18, None, Non 4626 yolo_conv3_6_lrelu[0][0]
__________________________________________________________________________________________________
bg_permute (Permute) (None, None, None, 1 0 conv_big_object[0][0]
__________________________________________________________________________________________________
md_permute (Permute) (None, None, None, 1 0 conv_mid_object[0][0]
__________________________________________________________________________________________________
bg_reshape (Reshape) (None, None, 6) 0 bg_permute[0][0]
__________________________________________________________________________________________________
md_reshape (Reshape) (None, None, 6) 0 md_permute[0][0]
__________________________________________________________________________________________________
bg_anchor (YOLOAnchorBox) (None, None, 6) 0 conv_big_object[0][0]
__________________________________________________________________________________________________
bg_bbox_processor (BBoxPostProc (None, None, 6) 0 bg_reshape[0][0]
__________________________________________________________________________________________________
md_anchor (YOLOAnchorBox) (None, None, 6) 0 conv_mid_object[0][0]
__________________________________________________________________________________________________
md_bbox_processor (BBoxPostProc (None, None, 6) 0 md_reshape[0][0]
__________________________________________________________________________________________________
encoded_bg (Concatenate) (None, None, 12) 0 bg_anchor[0][0]
bg_bbox_processor[0][0]
__________________________________________________________________________________________________
encoded_md (Concatenate) (None, None, 12) 0 md_anchor[0][0]
md_bbox_processor[0][0]
__________________________________________________________________________________________________
encoded_detections (Concatenate (None, None, 12) 0 encoded_bg[0][0]
encoded_md[0][0]
==================================================================================================
Total params: 5,880,324
Trainable params: 5,874,116
Non-trainable params: 6,208
__________________________________________________________________________________________________
INFO: Starting Training Loop.
Epoch 1/80
1250/1250 [==============================] - 972s 778ms/step - loss: 18.4968
e9dcc224184e:227:246 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
e9dcc224184e:227:246 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
e9dcc224184e:227:246 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
e9dcc224184e:227:246 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
e9dcc224184e:227:246 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
e9dcc224184e:227:246 [0] NCCL INFO cudaDriverVersion 12080
NCCL version 2.15.5+cuda11.8
e9dcc224184e:227:246 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
e9dcc224184e:227:246 [0] NCCL INFO P2P plugin IBext
e9dcc224184e:227:246 [0] NCCL INFO NET/IB : No device found.
e9dcc224184e:227:246 [0] NCCL INFO NET/IB : No device found.
e9dcc224184e:227:246 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
e9dcc224184e:227:246 [0] NCCL INFO Using network Socket
e9dcc224184e:227:246 [0] NCCL INFO Channel 00/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 01/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 02/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 03/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 04/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 05/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 06/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 07/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 08/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 09/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 10/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 11/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 12/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 13/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 14/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 15/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 16/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 17/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 18/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 19/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 20/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 21/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 22/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 23/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 24/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 25/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 26/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 27/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 28/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 29/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 30/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Channel 31/32 : 0
e9dcc224184e:227:246 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
e9dcc224184e:227:246 [0] NCCL INFO Connected all rings
e9dcc224184e:227:246 [0] NCCL INFO Connected all trees
e9dcc224184e:227:246 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
e9dcc224184e:227:246 [0] NCCL INFO comm 0x7ae2dc220dc0 rank 0 nranks 1 cudaDev 0 busId 1e0 - Init COMPLETE
INFO: Training loop in progress
Epoch 2/80
1250/1250 [==============================] - 751s 601ms/step - loss: 8.2688
INFO: Training loop in progress
Epoch 3/80
1250/1250 [==============================] - 679s 543ms/step - loss: 6.5264
INFO: Training loop in progress
Epoch 4/80
1250/1250 [==============================] - 620s 496ms/step - loss: 5.7025
INFO: Training loop in progress
Epoch 5/80
1249/1250 [============================>.] - ETA: 0s - loss: 4.5735Killed
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>
Execution status: FAIL
Tried it 2 times - it failed after 5th epoch in both cases.