Why it is killed when start training with tao toolkit?

2021-09-16 06:18:47,963 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:9: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

Epoch 1/1
Killed
2021-09-16 14:21:17,998 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Blockquote
random_seed: 42
yolov4_config {
big_anchor_shape: “[(46.65, 215.37),(89.72, 297.77),(171.49, 468.96)]”
mid_anchor_shape: “[(29.25, 40.61),(24.02, 118.86),(65.14, 74.95)]”
small_anchor_shape: “[(5.13, 8.92),(8.97, 16.21),(16.42, 25.33)]”
box_matching_iou: 0.25
matching_neutral_box_iou: 0.5
arch: “resnet”
nlayers: 50
arch_conv_blocks: 2
loss_loc_weight: 0.8
loss_neg_obj_weights: 100.0
loss_class_weights: 0.5
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.1
small_grid_xy_extend: 0.2
freeze_bn: false
#freeze_blocks: 0
force_relu: false
}
training_config {
batch_size_per_gpu: 8
num_epochs: 1
enable_qat: false
checkpoint_interval: 1
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-7
max_learning_rate: 1e-4
soft_start: 0.3
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
pretrain_model_path: “/workspace/tao-experiments/yolo_v4/pretrained_resnet50/pretrained_object_detection_vresnet50/resnet_50.hdf5”
}
eval_config {
average_precision_mode: SAMPLE
batch_size: 8
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.001
clustering_iou_threshold: 0.5
force_on_cpu: true
top_k: 200
}
augmentation_config {
hue: 0.1
saturation: 1.5
exposure:1.5
vertical_flip:0
horizontal_flip: 0.5
jitter: 0.3
output_width: 608
output_height: 608
output_channel: 3
randomize_input_shape_period: 0
mosaic_prob: 0.5
mosaic_min_ratio:0.2
}
dataset_config {
data_sources: {
tfrecords_path: “/workspace/tao-experiments/data/training/tfrecords/train*”
image_directory_path: “/workspace/tao-experiments/data/training”
}
include_difficult_in_training: true
image_extension: “jpg”
target_class_mapping {
key: “badge”
value: “badge”
}
target_class_mapping {
key: “person”
value: “person”
}
target_class_mapping {
key: “glove”
value: “glove”
}
target_class_mapping {
key: “wrongglove”
value: “wrongglove”
}
target_class_mapping {
key: “operatingbar”
value: “operatingbar”
}
target_class_mapping {
key: “powerchecker”
value: “powerchecker”
}
validation_data_sources: {
tfrecords_path: “/workspace/tao-experiments/data/val/tfrecords/val*”
image_directory_path: “/workspace/tao-experiments/data/val”
}
}

Please check if other application occupying gpu memory.
$ nvidia-smi

image
image
when i change the batch_size_per_gpu to 2,it can worked,but very slowly , how to change config to accelerate
the training images size

test
image resolution is too height ?e.g. 3268*1632 2250 *4000

So, the “killed” issue is gone, right?

yeah,but very slowly

Please refer to YOLOv4 — TAO Toolkit 3.0 documentation

YOLOv4 supports two data formats: the sequence format (KITTI images folder and raw labels folder) and the tfrecords format (KITTI images folder and TFRecords). From our experience, if mosaic augmentation is disabled (mosaic_prob=0), training with TFRecords format is faster. If mosaic augmentation is enabled (mosaic_prob>0), training with sequence format is faster.

Since you are using tfrecord format, please try with mosaic_prob=0.

OK,thank you