Problem:
After some iterations at the first epoch, the processing (training) gets extremely slow (low GPU activity) and my server continuously using large amounts of swap space (100GB).
I did some tests with detectnet_v2 the train start fast and no issue, but due low %MAP on detectnet_v2 I was forced to move to yolov4, but the training using yolov4 is extremely slow to start (about 20 minutes to start use GPU) and after some iterations all memory is used causing OOM Kill.
Env Info:
Memory
ubuntu@xxxxx:~$ free -g
total used free shared buff/cache available
Mem: 186 2 182 0 1 182
Swap: 119 0 119
GPU
Fri Jul 1 12:43:38 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 |
| 0% 35C P0 62W / 300W | 1144MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G On | 00000000:00:1C.0 Off | 0 |
| 0% 31C P8 23W / 300W | 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G On | 00000000:00:1D.0 Off | 0 |
| 0% 30C P8 24W / 300W | 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 31C P8 25W / 300W | 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 112672 C /usr/bin/python3.6 245MiB |
| 0 N/A N/A 112799 C python3.6 897MiB |
+-----------------------------------------------------------------------------+
Tao info
Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 2.0
toolkit_version: 3.22.05
published_date: 05/25/2022
Dataset Size
Number of images in the train/val set. 142273
Number of labels in the train/val set. 142273
Number of images in the test set. 4802
.tao_mounts.json
{
"Mounts": [
{
"source": "/home/ubuntu/nvidia_tao/projects/yolov4",
"destination": "/workspace/tao-experiments"
},
{
"source": "/home/ubuntu/nvidia_tao/cv_samples_v1.2.0/yolo_v4/specs",
"destination": "/workspace/tao-experiments/yolo_v4/specs"
},
{
"source": "/home/ubuntu/nvidia_tao/projects/dataset",
"destination": "/workspace/tao-experiments/data"
}
],
"Envs": [
{
"variable": "CUDA_DEVICE_ORDER",
"value": "PCI_BUS_ID"
}
],
"DockerOptions": {
"shm_size": "36G",
"ulimits": {
"memlock": 0,
"stack": 67108864
}
}
}
!tao yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet18_kitti.txt \
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
-k $KEY \
--gpus 4
yolo_v4_train_resnet18_kitti.txt
random_seed: 42
yolov4_config {
big_anchor_shape: "[(17.55, 8.00),(31.20, 11.73),(26.00, 19.20)]"
mid_anchor_shape: "[(42.90, 20.27),(78.97, 25.24),(50.05, 41.07)]"
small_anchor_shape: "[(104.65, 52.27),(177.45, 79.29),(358.80, 150.93)]"
box_matching_iou: 0.25
matching_neutral_box_iou: 0.5
arch: "cspdarknet"
nlayers: 53
arch_conv_blocks: 2
loss_loc_weight: 0.8
loss_neg_obj_weights: 100.0
loss_class_weights: 0.5
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.1
small_grid_xy_extend: 0.2
freeze_bn: false
freeze_blocks: 0
freeze_blocks: 1
freeze_blocks: 2
freeze_blocks: 3
freeze_blocks: 4
freeze_blocks: 5
force_relu: true
}
training_config {
batch_size_per_gpu: 8
num_epochs: 160
enable_qat: true
checkpoint_interval: 10
n_workers: 8
use_multiprocessing: true
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-7
max_learning_rate: 1e-2
soft_start: 0.3
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
pretrain_model_path: "/workspace/tao-experiments/yolo_v4/pretrained_cspdarknet53/pretrained_object_detection_vcspdarknet53/cspdarknet_53.hdf5"
visualizer {
enabled: true
num_images: 3
}
}
eval_config {
average_precision_mode: SAMPLE
batch_size: 8
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.001
clustering_iou_threshold: 0.5
force_on_cpu: true
top_k: 200
}
augmentation_config {
hue: 0.1
saturation: 1.5
exposure:1.5
vertical_flip:0
horizontal_flip: 0.5
jitter: 0.3
output_width: 1248
output_height: 384
output_channel: 3
randomize_input_shape_period: 100
mosaic_prob: 0.5
mosaic_min_ratio:0.2
image_mean {
key: 'b'
value: 103.9
}
image_mean {
key: 'g'
value: 116.8
}
image_mean {
key: 'r'
value: 123.7
}
}
dataset_config {
data_sources: {
tfrecords_path: "/workspace/tao-experiments/data/training/tfrecords/train*"
image_directory_path: "/workspace/tao-experiments/data/training/"
}
include_difficult_in_training: true
image_extension: "jpg"
target_class_mapping {
key: "carro"
value: "carro"
}
target_class_mapping {
key: "moto"
value: "moto"
}
target_class_mapping {
key: "onibus"
value: "onibus"
}
target_class_mapping {
key: "utilitario"
value: "utilitario"
}
target_class_mapping {
key: "caminhao"
value: "caminhao"
}
target_class_mapping {
key: "ciclista"
value: "ciclista"
}
target_class_mapping {
key: "pedestre"
value: "pedestre"
}
target_class_mapping {
key: "placa"
value: "placa"
}
validation_data_sources: {
tfrecords_path: "/workspace/tao-experiments/data/val/tfrecords/val*"
image_directory_path: "/workspace/tao-experiments/data/val/"
}
}
Trainning
INFO: Starting Training Loop.
Epoch 1/160
caf1e99fa719:153:604 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
caf1e99fa719:153:604 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
caf1e99fa719:153:604 [0] NCCL INFO P2P plugin IBext
caf1e99fa719:153:604 [0] NCCL INFO NET/IB : No device found.
caf1e99fa719:153:604 [0] NCCL INFO NET/IB : No device found.
caf1e99fa719:153:604 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
caf1e99fa719:153:604 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
caf1e99fa719:162:606 [3] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
caf1e99fa719:154:601 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
caf1e99fa719:162:606 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
caf1e99fa719:162:606 [3] NCCL INFO P2P plugin IBext
caf1e99fa719:154:601 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
caf1e99fa719:154:601 [1] NCCL INFO P2P plugin IBext
caf1e99fa719:162:606 [3] NCCL INFO NET/IB : No device found.
caf1e99fa719:154:601 [1] NCCL INFO NET/IB : No device found.
caf1e99fa719:162:606 [3] NCCL INFO NET/IB : No device found.
caf1e99fa719:162:606 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
caf1e99fa719:162:606 [3] NCCL INFO Using network Socket
caf1e99fa719:154:601 [1] NCCL INFO NET/IB : No device found.
caf1e99fa719:154:601 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
caf1e99fa719:154:601 [1] NCCL INFO Using network Socket
caf1e99fa719:158:602 [2] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
caf1e99fa719:158:602 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
caf1e99fa719:158:602 [2] NCCL INFO P2P plugin IBext
caf1e99fa719:158:602 [2] NCCL INFO NET/IB : No device found.
caf1e99fa719:158:602 [2] NCCL INFO NET/IB : No device found.
caf1e99fa719:158:602 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
caf1e99fa719:158:602 [2] NCCL INFO Using network Socket
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 0(=1b0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 1(=1c0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 3(=1e0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 0(=1b0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 1(=1c0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 3(=1e0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 0(=1b0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 1(=1c0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 3(=1e0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 0(=1b0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 1(=1c0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 3(=1e0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 1(=1c0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 0(=1b0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 3(=1e0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 1(=1c0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 3(=1e0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 0(=1b0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
caf1e99fa719:158:602 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
caf1e99fa719:153:604 [0] NCCL INFO Channel 00/02 : 0 1 2 3
caf1e99fa719:153:604 [0] NCCL INFO Channel 01/02 : 0 1 2 3
caf1e99fa719:153:604 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
caf1e99fa719:162:606 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 3(=1e0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:158:602 [2] NCCL INFO Channel 00 : 2[1d0] -> 3[1e0] via direct shared memory
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Channel 00 : 1[1c0] -> 2[1d0] via direct shared memory
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:162:606 [3] NCCL INFO Channel 00 : 3[1e0] -> 0[1b0] via direct shared memory
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 0(=1b0)
caf1e99fa719:153:604 [0] NCCL INFO Channel 00 : 0[1b0] -> 1[1c0] via direct shared memory
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:158:602 [2] NCCL INFO Channel 01 : 2[1d0] -> 3[1e0] via direct shared memory
caf1e99fa719:154:601 [1] NCCL INFO Channel 01 : 1[1c0] -> 2[1d0] via direct shared memory
caf1e99fa719:162:606 [3] NCCL INFO Channel 01 : 3[1e0] -> 0[1b0] via direct shared memory
caf1e99fa719:153:604 [0] NCCL INFO Channel 01 : 0[1b0] -> 1[1c0] via direct shared memory
caf1e99fa719:162:606 [3] NCCL INFO Connected all rings
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:158:602 [2] NCCL INFO Connected all rings
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Connected all rings
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:162:606 [3] NCCL INFO Channel 00 : 3[1e0] -> 2[1d0] via direct shared memory
caf1e99fa719:162:606 [3] NCCL INFO Could not enable P2P between dev 3(=1e0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Connected all rings
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:162:606 [3] NCCL INFO Channel 01 : 3[1e0] -> 2[1d0] via direct shared memory
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 3(=1e0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
caf1e99fa719:153:604 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Channel 00 : 1[1c0] -> 0[1b0] via direct shared memory
caf1e99fa719:154:601 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
caf1e99fa719:158:602 [2] NCCL INFO Channel 00 : 2[1d0] -> 1[1c0] via direct shared memory
caf1e99fa719:158:602 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
caf1e99fa719:154:601 [1] NCCL INFO Channel 01 : 1[1c0] -> 0[1b0] via direct shared memory
caf1e99fa719:158:602 [2] NCCL INFO Channel 01 : 2[1d0] -> 1[1c0] via direct shared memory
caf1e99fa719:153:604 [0] NCCL INFO Connected all trees
caf1e99fa719:153:604 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
caf1e99fa719:153:604 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
caf1e99fa719:162:606 [3] NCCL INFO Connected all trees
caf1e99fa719:162:606 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
caf1e99fa719:162:606 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
caf1e99fa719:158:602 [2] NCCL INFO Connected all trees
caf1e99fa719:158:602 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
caf1e99fa719:158:602 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
caf1e99fa719:154:601 [1] NCCL INFO Connected all trees
caf1e99fa719:154:601 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
caf1e99fa719:154:601 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
caf1e99fa719:154:601 [1] NCCL INFO comm 0x7f9a947fd020 rank 1 nranks 4 cudaDev 1 busId 1c0 - Init COMPLETE
caf1e99fa719:158:602 [2] NCCL INFO comm 0x7f71b07fdc80 rank 2 nranks 4 cudaDev 2 busId 1d0 - Init COMPLETE
caf1e99fa719:162:606 [3] NCCL INFO comm 0x7f55ec7fd440 rank 3 nranks 4 cudaDev 3 busId 1e0 - Init COMPLETE
caf1e99fa719:153:604 [0] NCCL INFO comm 0x7fad74848700 rank 0 nranks 4 cudaDev 0 busId 1b0 - Init COMPLETE
caf1e99fa719:153:604 [0] NCCL INFO Launch mode Parallel
1/8003 [..............................] - ETA: 67:03:05 - loss: 1827485.8750WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:186: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.
WARNING: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:186: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.
5556/8003 [===================>..........] - ETA: 59:34 - loss: 71390.9329
Swappiness
ubuntu@ip-xxxxx:~$ free -g
total used free shared buff/cache available
Mem: 186 169 2 14 14 1
Swap: 119 50 69
top - 17:24:06 up 4 days, 6:31, 1 user, load average: 39.05, 40.18, 54.23
Tasks: 574 total, 1 running, 573 sleeping, 0 stopped, 0 zombie
%Cpu(s): 6.3 us, 1.0 sy, 0.0 ni, 39.2 id, 53.4 wa, 0.0 hi, 0.1 si, 0.0 st
MiB Mem : 191197.9 total, 1291.7 free, 174934.5 used, 14971.7 buff/cache
MiB Swap: 122880.0 total, 71787.7 free, 51092.3 used. 316.2 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
113731 root 20 0 110.4g 38.4g 301076 S 110.0 20.6 997:29.50 python3.6
113727 root 20 0 139.9g 48.6g 317624 S 105.0 26.0 998:14.29 python3.6
113735 root 20 0 110.6g 38.6g 285260 S 100.3 20.7 982:12.51 python3.6
113726 root 20 0 107.9g 37.9g 302560 S 12.3 20.3 974:57.70 python3.6
309 root 20 0 0 0 0 S 12.0 0.0 104:43.38 kcompactd0