Yolo_v4 getting stuck while training

• Hardware: T4
• Network Type: Yolo_v4
• TLT Version :
Configuration of the TLT Instance
dockers: [‘nvidia/tlt-streamanalytics’, ‘nvidia/tlt-pytorch’]
format_version: 1.0
tlt_version: 3.0
published_date: 04/16/2021
• Training spec file:
random_seed: 42
yolov4_config {
big_anchor_shape: “[(90.05, 188.05),(165.34, 131.00),(235.00, 278.53)]”
mid_anchor_shape: “[(76.00, 52.00),(46.50, 113.00),(118.00, 69.00)]”
small_anchor_shape: “[(28.00, 19.00),(54.02, 33.00),(29.17, 68.00)]”
box_matching_iou: 0.5
arch: “resnet”
nlayers: 18
arch_conv_blocks: 2
loss_loc_weight: 0.8
loss_neg_obj_weights: 100.0
loss_class_weights: 0.5
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.1
small_grid_xy_extend: 0.2
freeze_bn: false
#freeze_blocks: 0
force_relu: false
}
training_config {
batch_size_per_gpu: 24
num_epochs: 80
enable_qat: false
checkpoint_interval: 10
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-7
max_learning_rate: 1e-4
soft_start: 0.3
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
pretrain_model_path: “/workspace/tlt-experiments/yolo_v4/pretrained_resnet18/tlt_pretrained_object_detection_vresnet18/resnet_18.hdf5”
}
eval_config {
average_precision_mode: SAMPLE
batch_size: 8
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.001
clustering_iou_threshold: 0.6
top_k: 200
}
augmentation_config {
hue: 0.1
saturation: 1.5
exposure:1.5
vertical_flip:0
horizontal_flip: 0.5
jitter: 0.3
output_width: 960
output_height: 544
output_channel: 3
randomize_input_shape_period: 0
mosaic_prob: 0.5
mosaic_min_ratio:0.2
}
dataset_config {
data_sources: {
label_directory_path: “/workspace/tlt-experiments/data/training/labels”
image_directory_path: “/workspace/tlt-experiments/data/training/images”
}
include_difficult_in_training: true
target_class_mapping {
key: “car”
value: “car”
}
target_class_mapping {
key: “pedestrian”
value: “pedestrian”
}
target_class_mapping {
key: “two_wheels”
value: “bike”
}
target_class_mapping {
key: “person”
value: “pedestrian”
}
validation_data_sources: {
label_directory_path: “/workspace/tlt-experiments/data/val/label”
image_directory_path: “/workspace/tlt-experiments/data/val/image”
}
}

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
2021-07-29 16:48:15,927 [INFO] root: Registry: [‘nvcr.io’]
2021-07-29 16:48:16,273 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
Using TensorFlow backend.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

2021-07-29 16:48:24,935 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2021-07-29 16:48:24,935 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py:52: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

2021-07-29 16:48:25,075 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py:52: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py:55: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2021-07-29 16:48:25,076 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py:55: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

2021-07-29 16:48:25,587 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

2021-07-29 16:48:25,590 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

2021-07-29 16:48:25,614 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

2021-07-29 16:48:26,246 [WARNING] tensorflow: From /opt/nvidia/third_party/keras/tensorflow_backend.py:183: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.

2021-07-29 16:48:26,508 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.

WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.

2021-07-29 16:48:29,049 [WARNING] tensorflow: From /opt/nvidia/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

2021-07-29 16:48:30,004 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

2021-07-29 16:48:30,005 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

2021-07-29 16:48:30,842 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

2021-07-29 16:48:31,858 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.

2021-07-29 16:48:31,862 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

2021-07-29 16:48:32,650 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

2021-07-29 16:48:32,826 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.


Layer (type) Output Shape Param # Connected to

Input (InputLayer) (None, 3, 544, 960) 0


conv1 (Conv2D) (None, 64, 272, 480) 9408 Input[0][0]


bn_conv1 (BatchNormalization) (None, 64, 272, 480) 256 conv1[0][0]


activation_2 (Activation) (None, 64, 272, 480) 0 bn_conv1[0][0]


block_1a_conv_1 (Conv2D) (None, 64, 136, 240) 36864 activation_2[0][0]


block_1a_bn_1 (BatchNormalizati (None, 64, 136, 240) 256 block_1a_conv_1[0][0]


block_1a_relu_1 (Activation) (None, 64, 136, 240) 0 block_1a_bn_1[0][0]


block_1a_conv_2 (Conv2D) (None, 64, 136, 240) 36864 block_1a_relu_1[0][0]


block_1a_conv_shortcut (Conv2D) (None, 64, 136, 240) 4096 activation_2[0][0]


block_1a_bn_2 (BatchNormalizati (None, 64, 136, 240) 256 block_1a_conv_2[0][0]


block_1a_bn_shortcut (BatchNorm (None, 64, 136, 240) 256 block_1a_conv_shortcut[0][0]


add_9 (Add) (None, 64, 136, 240) 0 block_1a_bn_2[0][0]
block_1a_bn_shortcut[0][0]


block_1a_relu (Activation) (None, 64, 136, 240) 0 add_9[0][0]


block_1b_conv_1 (Conv2D) (None, 64, 136, 240) 36864 block_1a_relu[0][0]


block_1b_bn_1 (BatchNormalizati (None, 64, 136, 240) 256 block_1b_conv_1[0][0]


block_1b_relu_1 (Activation) (None, 64, 136, 240) 0 block_1b_bn_1[0][0]


block_1b_conv_2 (Conv2D) (None, 64, 136, 240) 36864 block_1b_relu_1[0][0]


block_1b_conv_shortcut (Conv2D) (None, 64, 136, 240) 4096 block_1a_relu[0][0]


block_1b_bn_2 (BatchNormalizati (None, 64, 136, 240) 256 block_1b_conv_2[0][0]


block_1b_bn_shortcut (BatchNorm (None, 64, 136, 240) 256 block_1b_conv_shortcut[0][0]


add_10 (Add) (None, 64, 136, 240) 0 block_1b_bn_2[0][0]
block_1b_bn_shortcut[0][0]


block_1b_relu (Activation) (None, 64, 136, 240) 0 add_10[0][0]


block_2a_conv_1 (Conv2D) (None, 128, 68, 120) 73728 block_1b_relu[0][0]


block_2a_bn_1 (BatchNormalizati (None, 128, 68, 120) 512 block_2a_conv_1[0][0]


block_2a_relu_1 (Activation) (None, 128, 68, 120) 0 block_2a_bn_1[0][0]


block_2a_conv_2 (Conv2D) (None, 128, 68, 120) 147456 block_2a_relu_1[0][0]


block_2a_conv_shortcut (Conv2D) (None, 128, 68, 120) 8192 block_1b_relu[0][0]


block_2a_bn_2 (BatchNormalizati (None, 128, 68, 120) 512 block_2a_conv_2[0][0]


block_2a_bn_shortcut (BatchNorm (None, 128, 68, 120) 512 block_2a_conv_shortcut[0][0]


add_11 (Add) (None, 128, 68, 120) 0 block_2a_bn_2[0][0]
block_2a_bn_shortcut[0][0]


block_2a_relu (Activation) (None, 128, 68, 120) 0 add_11[0][0]


block_2b_conv_1 (Conv2D) (None, 128, 68, 120) 147456 block_2a_relu[0][0]


block_2b_bn_1 (BatchNormalizati (None, 128, 68, 120) 512 block_2b_conv_1[0][0]


block_2b_relu_1 (Activation) (None, 128, 68, 120) 0 block_2b_bn_1[0][0]


block_2b_conv_2 (Conv2D) (None, 128, 68, 120) 147456 block_2b_relu_1[0][0]


block_2b_conv_shortcut (Conv2D) (None, 128, 68, 120) 16384 block_2a_relu[0][0]


block_2b_bn_2 (BatchNormalizati (None, 128, 68, 120) 512 block_2b_conv_2[0][0]


block_2b_bn_shortcut (BatchNorm (None, 128, 68, 120) 512 block_2b_conv_shortcut[0][0]


add_12 (Add) (None, 128, 68, 120) 0 block_2b_bn_2[0][0]
block_2b_bn_shortcut[0][0]


block_2b_relu (Activation) (None, 128, 68, 120) 0 add_12[0][0]


block_3a_conv_1 (Conv2D) (None, 256, 34, 60) 294912 block_2b_relu[0][0]


block_3a_bn_1 (BatchNormalizati (None, 256, 34, 60) 1024 block_3a_conv_1[0][0]


block_3a_relu_1 (Activation) (None, 256, 34, 60) 0 block_3a_bn_1[0][0]


block_3a_conv_2 (Conv2D) (None, 256, 34, 60) 589824 block_3a_relu_1[0][0]


block_3a_conv_shortcut (Conv2D) (None, 256, 34, 60) 32768 block_2b_relu[0][0]


block_3a_bn_2 (BatchNormalizati (None, 256, 34, 60) 1024 block_3a_conv_2[0][0]


block_3a_bn_shortcut (BatchNorm (None, 256, 34, 60) 1024 block_3a_conv_shortcut[0][0]


add_13 (Add) (None, 256, 34, 60) 0 block_3a_bn_2[0][0]
block_3a_bn_shortcut[0][0]


block_3a_relu (Activation) (None, 256, 34, 60) 0 add_13[0][0]


block_3b_conv_1 (Conv2D) (None, 256, 34, 60) 589824 block_3a_relu[0][0]


block_3b_bn_1 (BatchNormalizati (None, 256, 34, 60) 1024 block_3b_conv_1[0][0]


block_3b_relu_1 (Activation) (None, 256, 34, 60) 0 block_3b_bn_1[0][0]


block_3b_conv_2 (Conv2D) (None, 256, 34, 60) 589824 block_3b_relu_1[0][0]


block_3b_conv_shortcut (Conv2D) (None, 256, 34, 60) 65536 block_3a_relu[0][0]


block_3b_bn_2 (BatchNormalizati (None, 256, 34, 60) 1024 block_3b_conv_2[0][0]


block_3b_bn_shortcut (BatchNorm (None, 256, 34, 60) 1024 block_3b_conv_shortcut[0][0]


add_14 (Add) (None, 256, 34, 60) 0 block_3b_bn_2[0][0]
block_3b_bn_shortcut[0][0]


block_3b_relu (Activation) (None, 256, 34, 60) 0 add_14[0][0]


block_4a_conv_1 (Conv2D) (None, 512, 34, 60) 1179648 block_3b_relu[0][0]


block_4a_bn_1 (BatchNormalizati (None, 512, 34, 60) 2048 block_4a_conv_1[0][0]


block_4a_relu_1 (Activation) (None, 512, 34, 60) 0 block_4a_bn_1[0][0]


block_4a_conv_2 (Conv2D) (None, 512, 34, 60) 2359296 block_4a_relu_1[0][0]


block_4a_conv_shortcut (Conv2D) (None, 512, 34, 60) 131072 block_3b_relu[0][0]


block_4a_bn_2 (BatchNormalizati (None, 512, 34, 60) 2048 block_4a_conv_2[0][0]


block_4a_bn_shortcut (BatchNorm (None, 512, 34, 60) 2048 block_4a_conv_shortcut[0][0]


add_15 (Add) (None, 512, 34, 60) 0 block_4a_bn_2[0][0]
block_4a_bn_shortcut[0][0]


block_4a_relu (Activation) (None, 512, 34, 60) 0 add_15[0][0]


block_4b_conv_1 (Conv2D) (None, 512, 34, 60) 2359296 block_4a_relu[0][0]


block_4b_bn_1 (BatchNormalizati (None, 512, 34, 60) 2048 block_4b_conv_1[0][0]


block_4b_relu_1 (Activation) (None, 512, 34, 60) 0 block_4b_bn_1[0][0]


block_4b_conv_2 (Conv2D) (None, 512, 34, 60) 2359296 block_4b_relu_1[0][0]


block_4b_conv_shortcut (Conv2D) (None, 512, 34, 60) 262144 block_4a_relu[0][0]


block_4b_bn_2 (BatchNormalizati (None, 512, 34, 60) 2048 block_4b_conv_2[0][0]


block_4b_bn_shortcut (BatchNorm (None, 512, 34, 60) 2048 block_4b_conv_shortcut[0][0]


add_16 (Add) (None, 512, 34, 60) 0 block_4b_bn_2[0][0]
block_4b_bn_shortcut[0][0]


block_4b_relu (Activation) (None, 512, 34, 60) 0 add_16[0][0]


yolo_spp_pool_1 (MaxPooling2D) (None, 512, 34, 60) 0 block_4b_relu[0][0]


yolo_spp_pool_2 (MaxPooling2D) (None, 512, 34, 60) 0 block_4b_relu[0][0]


yolo_spp_pool_3 (MaxPooling2D) (None, 512, 34, 60) 0 block_4b_relu[0][0]


yolo_spp_concat (Concatenate) (None, 2048, 34, 60) 0 yolo_spp_pool_1[0][0]
yolo_spp_pool_2[0][0]
yolo_spp_pool_3[0][0]
block_4b_relu[0][0]


yolo_spp_conv (Conv2D) (None, 512, 34, 60) 1048576 yolo_spp_concat[0][0]


yolo_spp_conv_bn (BatchNormaliz (None, 512, 34, 60) 2048 yolo_spp_conv[0][0]


yolo_spp_conv_lrelu (LeakyReLU) (None, 512, 34, 60) 0 yolo_spp_conv_bn[0][0]


yolo_expand_conv1 (Conv2D) (None, 512, 17, 30) 2359296 yolo_spp_conv_lrelu[0][0]


yolo_expand_conv1_bn (BatchNorm (None, 512, 17, 30) 2048 yolo_expand_conv1[0][0]


yolo_expand_conv1_lrelu (LeakyR (None, 512, 17, 30) 0 yolo_expand_conv1_bn[0][0]


yolo_conv1_1 (Conv2D) (None, 256, 17, 30) 131072 yolo_expand_conv1_lrelu[0][0]


yolo_conv1_1_bn (BatchNormaliza (None, 256, 17, 30) 1024 yolo_conv1_1[0][0]


yolo_conv1_1_lrelu (LeakyReLU) (None, 256, 17, 30) 0 yolo_conv1_1_bn[0][0]


yolo_conv1_2 (Conv2D) (None, 512, 17, 30) 1179648 yolo_conv1_1_lrelu[0][0]


yolo_conv1_2_bn (BatchNormaliza (None, 512, 17, 30) 2048 yolo_conv1_2[0][0]


yolo_conv1_2_lrelu (LeakyReLU) (None, 512, 17, 30) 0 yolo_conv1_2_bn[0][0]


yolo_conv1_3 (Conv2D) (None, 256, 17, 30) 131072 yolo_conv1_2_lrelu[0][0]


yolo_conv1_3_bn (BatchNormaliza (None, 256, 17, 30) 1024 yolo_conv1_3[0][0]


yolo_conv1_3_lrelu (LeakyReLU) (None, 256, 17, 30) 0 yolo_conv1_3_bn[0][0]


yolo_conv1_4 (Conv2D) (None, 512, 17, 30) 1179648 yolo_conv1_3_lrelu[0][0]


yolo_conv1_4_bn (BatchNormaliza (None, 512, 17, 30) 2048 yolo_conv1_4[0][0]


yolo_conv1_4_lrelu (LeakyReLU) (None, 512, 17, 30) 0 yolo_conv1_4_bn[0][0]


yolo_conv1_5 (Conv2D) (None, 256, 17, 30) 131072 yolo_conv1_4_lrelu[0][0]


yolo_conv1_5_bn (BatchNormaliza (None, 256, 17, 30) 1024 yolo_conv1_5[0][0]


yolo_conv1_5_lrelu (LeakyReLU) (None, 256, 17, 30) 0 yolo_conv1_5_bn[0][0]


yolo_conv2 (Conv2D) (None, 128, 17, 30) 32768 yolo_conv1_5_lrelu[0][0]


yolo_conv2_bn (BatchNormalizati (None, 128, 17, 30) 512 yolo_conv2[0][0]


yolo_conv2_lrelu (LeakyReLU) (None, 128, 17, 30) 0 yolo_conv2_bn[0][0]


upsample0 (UpSampling2D) (None, 128, 34, 60) 0 yolo_conv2_lrelu[0][0]


concatenate_3 (Concatenate) (None, 384, 34, 60) 0 upsample0[0][0]
block_3b_relu[0][0]


yolo_conv3_1 (Conv2D) (None, 128, 34, 60) 49152 concatenate_3[0][0]


yolo_conv3_1_bn (BatchNormaliza (None, 128, 34, 60) 512 yolo_conv3_1[0][0]


yolo_conv3_1_lrelu (LeakyReLU) (None, 128, 34, 60) 0 yolo_conv3_1_bn[0][0]


yolo_conv3_2 (Conv2D) (None, 256, 34, 60) 294912 yolo_conv3_1_lrelu[0][0]


yolo_conv3_2_bn (BatchNormaliza (None, 256, 34, 60) 1024 yolo_conv3_2[0][0]


yolo_conv3_2_lrelu (LeakyReLU) (None, 256, 34, 60) 0 yolo_conv3_2_bn[0][0]


yolo_conv3_3 (Conv2D) (None, 128, 34, 60) 32768 yolo_conv3_2_lrelu[0][0]


yolo_conv3_3_bn (BatchNormaliza (None, 128, 34, 60) 512 yolo_conv3_3[0][0]


yolo_conv3_3_lrelu (LeakyReLU) (None, 128, 34, 60) 0 yolo_conv3_3_bn[0][0]


yolo_conv3_4 (Conv2D) (None, 256, 34, 60) 294912 yolo_conv3_3_lrelu[0][0]


yolo_conv3_4_bn (BatchNormaliza (None, 256, 34, 60) 1024 yolo_conv3_4[0][0]


yolo_conv3_4_lrelu (LeakyReLU) (None, 256, 34, 60) 0 yolo_conv3_4_bn[0][0]


yolo_conv3_5 (Conv2D) (None, 128, 34, 60) 32768 yolo_conv3_4_lrelu[0][0]


yolo_conv3_5_bn (BatchNormaliza (None, 128, 34, 60) 512 yolo_conv3_5[0][0]


yolo_conv3_5_lrelu (LeakyReLU) (None, 128, 34, 60) 0 yolo_conv3_5_bn[0][0]


yolo_conv4 (Conv2D) (None, 64, 34, 60) 8192 yolo_conv3_5_lrelu[0][0]


yolo_conv4_bn (BatchNormalizati (None, 64, 34, 60) 256 yolo_conv4[0][0]


yolo_conv4_lrelu (LeakyReLU) (None, 64, 34, 60) 0 yolo_conv4_bn[0][0]


upsample1 (UpSampling2D) (None, 64, 68, 120) 0 yolo_conv4_lrelu[0][0]


concatenate_4 (Concatenate) (None, 192, 68, 120) 0 upsample1[0][0]
block_2b_relu[0][0]


yolo_conv5_1 (Conv2D) (None, 64, 68, 120) 12288 concatenate_4[0][0]


yolo_conv5_1_bn (BatchNormaliza (None, 64, 68, 120) 256 yolo_conv5_1[0][0]


yolo_conv5_1_lrelu (LeakyReLU) (None, 64, 68, 120) 0 yolo_conv5_1_bn[0][0]


yolo_conv5_2 (Conv2D) (None, 128, 68, 120) 73728 yolo_conv5_1_lrelu[0][0]


yolo_conv5_2_bn (BatchNormaliza (None, 128, 68, 120) 512 yolo_conv5_2[0][0]


yolo_conv5_2_lrelu (LeakyReLU) (None, 128, 68, 120) 0 yolo_conv5_2_bn[0][0]


yolo_conv5_3 (Conv2D) (None, 64, 68, 120) 8192 yolo_conv5_2_lrelu[0][0]


yolo_conv5_3_bn (BatchNormaliza (None, 64, 68, 120) 256 yolo_conv5_3[0][0]


yolo_conv5_3_lrelu (LeakyReLU) (None, 64, 68, 120) 0 yolo_conv5_3_bn[0][0]


yolo_conv5_4 (Conv2D) (None, 128, 68, 120) 73728 yolo_conv5_3_lrelu[0][0]


yolo_conv5_4_bn (BatchNormaliza (None, 128, 68, 120) 512 yolo_conv5_4[0][0]


yolo_conv5_4_lrelu (LeakyReLU) (None, 128, 68, 120) 0 yolo_conv5_4_bn[0][0]


yolo_conv5_5 (Conv2D) (None, 64, 68, 120) 8192 yolo_conv5_4_lrelu[0][0]


yolo_conv5_5_bn (BatchNormaliza (None, 64, 68, 120) 256 yolo_conv5_5[0][0]


yolo_conv5_5_lrelu (LeakyReLU) (None, 64, 68, 120) 0 yolo_conv5_5_bn[0][0]


yolo_conv1_6 (Conv2D) (None, 512, 17, 30) 1179648 yolo_conv1_5_lrelu[0][0]


yolo_conv3_6 (Conv2D) (None, 256, 34, 60) 294912 yolo_conv3_5_lrelu[0][0]


yolo_conv5_6 (Conv2D) (None, 128, 68, 120) 73728 yolo_conv5_5_lrelu[0][0]


yolo_conv1_6_bn (BatchNormaliza (None, 512, 17, 30) 2048 yolo_conv1_6[0][0]


yolo_conv3_6_bn (BatchNormaliza (None, 256, 34, 60) 1024 yolo_conv3_6[0][0]


yolo_conv5_6_bn (BatchNormaliza (None, 128, 68, 120) 512 yolo_conv5_6[0][0]


yolo_conv1_6_lrelu (LeakyReLU) (None, 512, 17, 30) 0 yolo_conv1_6_bn[0][0]


yolo_conv3_6_lrelu (LeakyReLU) (None, 256, 34, 60) 0 yolo_conv3_6_bn[0][0]


yolo_conv5_6_lrelu (LeakyReLU) (None, 128, 68, 120) 0 yolo_conv5_6_bn[0][0]


conv_big_object (Conv2D) (None, 24, 17, 30) 12312 yolo_conv1_6_lrelu[0][0]


conv_mid_object (Conv2D) (None, 24, 34, 60) 6168 yolo_conv3_6_lrelu[0][0]


conv_sm_object (Conv2D) (None, 24, 68, 120) 3096 yolo_conv5_6_lrelu[0][0]


bg_permute (Permute) (None, 17, 30, 24) 0 conv_big_object[0][0]


md_permute (Permute) (None, 34, 60, 24) 0 conv_mid_object[0][0]


sm_permute (Permute) (None, 68, 120, 24) 0 conv_sm_object[0][0]


bg_reshape (Reshape) (None, 1530, 8) 0 bg_permute[0][0]


md_reshape (Reshape) (None, 6120, 8) 0 md_permute[0][0]


sm_reshape (Reshape) (None, 24480, 8) 0 sm_permute[0][0]


bg_anchor (YOLOAnchorBox) (None, 1530, 6) 0 conv_big_object[0][0]


bg_bbox_processor (BBoxPostProc (None, 1530, 8) 0 bg_reshape[0][0]


md_anchor (YOLOAnchorBox) (None, 6120, 6) 0 conv_mid_object[0][0]


md_bbox_processor (BBoxPostProc (None, 6120, 8) 0 md_reshape[0][0]


sm_anchor (YOLOAnchorBox) (None, 24480, 6) 0 conv_sm_object[0][0]


sm_bbox_processor (BBoxPostProc (None, 24480, 8) 0 sm_reshape[0][0]


encoded_bg (Concatenate) (None, 1530, 14) 0 bg_anchor[0][0]
bg_bbox_processor[0][0]


encoded_md (Concatenate) (None, 6120, 14) 0 md_anchor[0][0]
md_bbox_processor[0][0]


encoded_sm (Concatenate) (None, 24480, 14) 0 sm_anchor[0][0]
sm_bbox_processor[0][0]


encoded_detections (Concatenate (None, 32130, 14) 0 encoded_bg[0][0]
encoded_md[0][0]
encoded_sm[0][0]

Total params: 20,215,304
Trainable params: 20,193,160
Non-trainable params: 22,144


2021-07-29 16:49:06,324 [INFO] main: Number of images in the training dataset: 10235
Epoch 1/80

Whenever I try to train Yolo_v4, it just gets stuck at that point, no errors are reported.

Running nvidia-smi yields:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA Tesla T4 On | 00000000:AF:00.0 Off | Off |
| N/A 69C P0 29W / 70W | 15706MiB / 16127MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

Can you set batch_size_per_gpu to a lower value and retry?

I’ve tried, it sometimes run and sometimes doesn’t and I can’t recognize any pattern as to when it gets stuck. I’m guessing it’s got something to do with my particular server rather than TLT itself. Thank you

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.