Training Yolov4 with 4 GPUs cause out of memory

Problem:
After some iterations at the first epoch, the processing (training) gets extremely slow (low GPU activity) and my server continuously using large amounts of swap space (100GB).

I did some tests with detectnet_v2 the train start fast and no issue, but due low %MAP on detectnet_v2 I was forced to move to yolov4, but the training using yolov4 is extremely slow to start (about 20 minutes to start use GPU) and after some iterations all memory is used causing OOM Kill.

Env Info:

Memory

ubuntu@xxxxx:~$ free -g
              total        used        free      shared  buff/cache   available
Mem:            186           2         182           0           1         182
Swap:           119           0         119

GPU

Fri Jul  1 12:43:38 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         On   | 00000000:00:1B.0 Off |                    0 |
|  0%   35C    P0    62W / 300W |   1144MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G         On   | 00000000:00:1C.0 Off |                    0 |
|  0%   31C    P8    23W / 300W |      2MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G         On   | 00000000:00:1D.0 Off |                    0 |
|  0%   30C    P8    24W / 300W |      2MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   31C    P8    25W / 300W |      2MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    112672      C   /usr/bin/python3.6                245MiB |
|    0   N/A  N/A    112799      C   python3.6                         897MiB |
+-----------------------------------------------------------------------------+

Tao info

Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 2.0
toolkit_version: 3.22.05
published_date: 05/25/2022

Dataset Size

Number of images in the train/val set. 142273
Number of labels in the train/val set. 142273
Number of images in the test set. 4802

.tao_mounts.json

{
    "Mounts": [
        {
            "source": "/home/ubuntu/nvidia_tao/projects/yolov4",
            "destination": "/workspace/tao-experiments"
        },
        {
            "source": "/home/ubuntu/nvidia_tao/cv_samples_v1.2.0/yolo_v4/specs",
            "destination": "/workspace/tao-experiments/yolo_v4/specs"
        },
        {
            "source": "/home/ubuntu/nvidia_tao/projects/dataset",
            "destination": "/workspace/tao-experiments/data"
        }
    ],
    "Envs": [
        {
            "variable": "CUDA_DEVICE_ORDER",
            "value": "PCI_BUS_ID"
        }
    ],
    "DockerOptions": {
        "shm_size": "36G",
        "ulimits": {
            "memlock": 0,
            "stack": 67108864
        }
    }
}
!tao yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet18_kitti.txt \
                   -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                   -k $KEY \
                   --gpus 4 

yolo_v4_train_resnet18_kitti.txt

random_seed: 42
yolov4_config {
  big_anchor_shape: "[(17.55, 8.00),(31.20, 11.73),(26.00, 19.20)]"
  mid_anchor_shape: "[(42.90, 20.27),(78.97, 25.24),(50.05, 41.07)]"
  small_anchor_shape: "[(104.65, 52.27),(177.45, 79.29),(358.80, 150.93)]"
  box_matching_iou: 0.25
  matching_neutral_box_iou: 0.5
  arch: "cspdarknet"
  nlayers: 53
  arch_conv_blocks: 2
  loss_loc_weight: 0.8
  loss_neg_obj_weights: 100.0
  loss_class_weights: 0.5
  label_smoothing: 0.0
  big_grid_xy_extend: 0.05
  mid_grid_xy_extend: 0.1
  small_grid_xy_extend: 0.2
  freeze_bn: false
  freeze_blocks: 0
  freeze_blocks: 1
  freeze_blocks: 2
  freeze_blocks: 3
  freeze_blocks: 4
  freeze_blocks: 5
  force_relu: true
}

training_config {
  batch_size_per_gpu: 8
  num_epochs: 160
  enable_qat: true
  checkpoint_interval: 10
  n_workers: 8
  use_multiprocessing: true
  learning_rate {
    soft_start_cosine_annealing_schedule {
      min_learning_rate: 1e-7
      max_learning_rate: 1e-2
      soft_start: 0.3
    }
  }
  regularizer {
    type: L1
    weight: 3e-5
  }
  optimizer {
    adam {
      epsilon: 1e-7
      beta1: 0.9
      beta2: 0.999
      amsgrad: false
    }
  }
  pretrain_model_path: "/workspace/tao-experiments/yolo_v4/pretrained_cspdarknet53/pretrained_object_detection_vcspdarknet53/cspdarknet_53.hdf5"
  visualizer {
    enabled: true
    num_images: 3
  }
}
eval_config {
  average_precision_mode: SAMPLE
  batch_size: 8
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.001
  clustering_iou_threshold: 0.5
  force_on_cpu: true
  top_k: 200
}
augmentation_config {
  hue: 0.1
  saturation: 1.5
  exposure:1.5
  vertical_flip:0
  horizontal_flip: 0.5
  jitter: 0.3
  output_width: 1248
  output_height: 384
  output_channel: 3
  randomize_input_shape_period: 100
  mosaic_prob: 0.5
  mosaic_min_ratio:0.2
  image_mean {
    key: 'b'
    value: 103.9
  }
  image_mean {
    key: 'g'
    value: 116.8
  }
  image_mean {
    key: 'r'
    value: 123.7
  }
}
dataset_config {
  data_sources: {
      tfrecords_path: "/workspace/tao-experiments/data/training/tfrecords/train*"
      image_directory_path: "/workspace/tao-experiments/data/training/"
  }
  include_difficult_in_training: true
  image_extension: "jpg"
  target_class_mapping {
    key: "carro"
    value: "carro"
  }
  target_class_mapping {
    key: "moto"
    value: "moto"
  }
  target_class_mapping {
    key: "onibus"
    value: "onibus"
  }
  target_class_mapping {
    key: "utilitario"
    value: "utilitario"
  }
  target_class_mapping {
    key: "caminhao"
    value: "caminhao"
  }
  target_class_mapping {
    key: "ciclista"
    value: "ciclista"
  }
  target_class_mapping {
    key: "pedestre"
    value: "pedestre"
  }
  target_class_mapping {
    key: "placa"
    value: "placa"
  }
  validation_data_sources: {
      tfrecords_path: "/workspace/tao-experiments/data/val/tfrecords/val*"
      image_directory_path: "/workspace/tao-experiments/data/val/"
  }
}

Trainning

INFO: Starting Training Loop.
Epoch 1/160
caf1e99fa719:153:604 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
caf1e99fa719:153:604 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
caf1e99fa719:153:604 [0] NCCL INFO P2P plugin IBext
caf1e99fa719:153:604 [0] NCCL INFO NET/IB : No device found.
caf1e99fa719:153:604 [0] NCCL INFO NET/IB : No device found.
caf1e99fa719:153:604 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
caf1e99fa719:153:604 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
caf1e99fa719:162:606 [3] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
caf1e99fa719:154:601 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
caf1e99fa719:162:606 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
caf1e99fa719:162:606 [3] NCCL INFO P2P plugin IBext
caf1e99fa719:154:601 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
caf1e99fa719:154:601 [1] NCCL INFO P2P plugin IBext
caf1e99fa719:162:606 [3] NCCL INFO NET/IB : No device found.
caf1e99fa719:154:601 [1] NCCL INFO NET/IB : No device found.
caf1e99fa719:162:606 [3] NCCL INFO NET/IB : No device found.
caf1e99fa719:162:606 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
caf1e99fa719:162:606 [3] NCCL INFO Using network Socket
caf1e99fa719:154:601 [1] NCCL INFO NET/IB : No device found.
caf1e99fa719:154:601 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
caf1e99fa719:154:601 [1] NCCL INFO Using network Socket
caf1e99fa719:158:602 [2] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
caf1e99fa719:158:602 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
caf1e99fa719:158:602 [2] NCCL INFO P2P plugin IBext
caf1e99fa719:158:602 [2] NCCL INFO NET/IB : No device found.
caf1e99fa719:158:602 [2] NCCL INFO NET/IB : No device found.
caf1e99fa719:158:602 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
caf1e99fa719:158:602 [2] NCCL INFO Using network Socket
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 0(=1b0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 1(=1c0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 3(=1e0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 0(=1b0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 1(=1c0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 3(=1e0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 0(=1b0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 1(=1c0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 3(=1e0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 0(=1b0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 1(=1c0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 3(=1e0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 1(=1c0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 0(=1b0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 3(=1e0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 1(=1c0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 3(=1e0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 0(=1b0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
caf1e99fa719:158:602 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
caf1e99fa719:153:604 [0] NCCL INFO Channel 00/02 :    0   1   2   3
caf1e99fa719:153:604 [0] NCCL INFO Channel 01/02 :    0   1   2   3
caf1e99fa719:153:604 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
caf1e99fa719:162:606 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:158:602 [2] NCCL INFO Channel 00 : 2[1d0] -> 3[1e0] via direct shared memory
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Channel 00 : 1[1c0] -> 2[1d0] via direct shared memory
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:162:606 [3] NCCL INFO Channel 00 : 3[1e0] -> 0[1b0] via direct shared memory
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:153:604 [0] NCCL INFO Channel 00 : 0[1b0] -> 1[1c0] via direct shared memory
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:158:602 [2] NCCL INFO Channel 01 : 2[1d0] -> 3[1e0] via direct shared memory
caf1e99fa719:154:601 [1] NCCL INFO Channel 01 : 1[1c0] -> 2[1d0] via direct shared memory
caf1e99fa719:162:606 [3] NCCL INFO Channel 01 : 3[1e0] -> 0[1b0] via direct shared memory
caf1e99fa719:153:604 [0] NCCL INFO Channel 01 : 0[1b0] -> 1[1c0] via direct shared memory
caf1e99fa719:162:606 [3] NCCL INFO Connected all rings
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:158:602 [2] NCCL INFO Connected all rings
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Connected all rings
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:162:606 [3] NCCL INFO Channel 00 : 3[1e0] -> 2[1d0] via direct shared memory
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Connected all rings
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:162:606 [3] NCCL INFO Channel 01 : 3[1e0] -> 2[1d0] via direct shared memory
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Channel 00 : 1[1c0] -> 0[1b0] via direct shared memory
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Channel 00 : 2[1d0] -> 1[1c0] via direct shared memory
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Channel 01 : 1[1c0] -> 0[1b0] via direct shared memory
caf1e99fa719:158:602 [2] NCCL INFO Channel 01 : 2[1d0] -> 1[1c0] via direct shared memory
caf1e99fa719:153:604 [0] NCCL INFO Connected all trees
caf1e99fa719:153:604 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
caf1e99fa719:153:604 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
caf1e99fa719:162:606 [3] NCCL INFO Connected all trees
caf1e99fa719:162:606 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
caf1e99fa719:162:606 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
caf1e99fa719:158:602 [2] NCCL INFO Connected all trees
caf1e99fa719:158:602 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
caf1e99fa719:158:602 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
caf1e99fa719:154:601 [1] NCCL INFO Connected all trees
caf1e99fa719:154:601 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
caf1e99fa719:154:601 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
caf1e99fa719:154:601 [1] NCCL INFO comm 0x7f9a947fd020 rank 1 nranks 4 cudaDev 1 busId 1c0 - Init COMPLETE
caf1e99fa719:158:602 [2] NCCL INFO comm 0x7f71b07fdc80 rank 2 nranks 4 cudaDev 2 busId 1d0 - Init COMPLETE
caf1e99fa719:162:606 [3] NCCL INFO comm 0x7f55ec7fd440 rank 3 nranks 4 cudaDev 3 busId 1e0 - Init COMPLETE
caf1e99fa719:153:604 [0] NCCL INFO comm 0x7fad74848700 rank 0 nranks 4 cudaDev 0 busId 1b0 - Init COMPLETE
caf1e99fa719:153:604 [0] NCCL INFO Launch mode Parallel
   1/8003 [..............................] - ETA: 67:03:05 - loss: 1827485.8750WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:186: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:186: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

5556/8003 [===================>..........] - ETA: 59:34 - loss: 71390.9329

Swappiness

ubuntu@ip-xxxxx:~$ free -g
              total        used        free      shared  buff/cache   available
Mem:            186         169           2          14          14           1
Swap:           119          50          69
top - 17:24:06 up 4 days,  6:31,  1 user,  load average: 39.05, 40.18, 54.23
Tasks: 574 total,   1 running, 573 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.3 us,  1.0 sy,  0.0 ni, 39.2 id, 53.4 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem : 191197.9 total,   1291.7 free, 174934.5 used,  14971.7 buff/cache
MiB Swap: 122880.0 total,  71787.7 free,  51092.3 used.    316.2 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 113731 root      20   0  110.4g  38.4g 301076 S 110.0  20.6 997:29.50 python3.6
 113727 root      20   0  139.9g  48.6g 317624 S 105.0  26.0 998:14.29 python3.6
 113735 root      20   0  110.6g  38.6g 285260 S 100.3  20.7 982:12.51 python3.6
 113726 root      20   0  107.9g  37.9g 302560 S  12.3  20.3 974:57.70 python3.6
    309 root      20   0       0      0      0 S  12.0   0.0 104:43.38 kcompactd0

Please set to
loss_loc_weight: 1.0
loss_neg_obj_weights: 1.0
loss_class_weights: 1.0

And do you really aim to train a 1248x384 model , right?

See YOLOv4 - NVIDIA Docs
From our experience, if mosaic augmentation is disabled (mosaic_prob=0), training with TFRecords format is faster

Please set to randomize_input_shape_period: 0
If K=0, the output width/height will always be the exact base width/height as configured, and training will be much faster.

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Thank you this solve my problem.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.