Training process is slow, GPU is not fully utilized

On a ubuntu 16.04 OS machine with 1080 8G GPU, I7 CPU and 32G memroy.

I use MSD liver_03 dataset to train a model base on nvcr.io/nvidia/clara-train-sdk:v1.0-py3 docker image.

I have read the related document.

The data is converted to nii format with 1mm resolution.

I downloaded the pretrained model(annotation_ct_liver_v1) from NGC, and use model/commands/train.sh to start the training on single GPU in a docker container.

NV_GPU=0 nvidia-docker run --shm-size=4g --ulimit memlock=-1 --ulimit stack=67108864 --runtime=nvidia -it --rm -v /mnt/workspace:/workspace -v /data/MSD_tmp_clara:/workspace/data -v /data/CNNPretrainedModel/kaggle_keras_pretrain/resnet50_weights_tf_dim_ordering_tf_kernels.h5:/var/tmp/resnet50_weights_tf_dim_ordering_tf_kernels.h5 /workspace/annotation_ct_liver_v1/commands/train.sh

On my traning process one epoch can cost 653.74 second, is this a normal speed?

log like this:

2019-08-22 04:19:41,322 - LegacyFitter - INFO - get_next_batch time:0.00s
Epoch: 3/2000, Iter: 52/52 [====================] train_loss: 0.2219 time: 0.49s
2019-08-22 04:19:41,809 - LegacyFitter - INFO - This epoch: 501.04s; per epoch: 599.76s; elapsed: 1799.29s; remaining: 1197730.27s; best metric: 0.1330786601238027 at epoch 1

During the training, the time of each iteration can be quite different, and CPU cost too much time, GPU is poor utilized.

I guess the preprocessing step cost too much cpu time but don’t know why.

I tried different “num_workers” and “prefetch_size” in config file, GPU is still poor utilized.

Any suggestions, please!

Did you give it a try with /workspace/annotation_ct_liver_v1/commands/train_2gpu.sh?

I think the performance should looks good at least on one GPU training case.

I run the two gpu training process, the gpu is still poorly utilized.
During the second training epoch, there are warnings about 60’s timeout seems related to slow training process problem.

nvidia-docker run --shm-size=4g --ulimit memlock=-1 --ulimit stack=67108864 --runtime=nvidia -it --rm -vmnt/sp7/DLDataset/docker/clara_model/workspace:/workspace -v /data/MSD_tmp_clara:/workspace/data -v /data/CNNPretrainedModel/kaggle_keras_pretrain/resnet50_weights_tf_dim_ordering_tf_kernels.h5:/var/tmp/resnet50_weights_tf_dim_ordering_tf_kernels.h5 nvcr.io/nvidia/clara-train-sdk:v1.0-py3 /workspace/annotation_ct_liver_v1/commands/train_2gpu.sh

================
== TensorFlow ==

NVIDIA Release 19.02 (build 5618942)
TensorFlow Version v1.13.0-rc0

Container image Copyright © 2018, NVIDIA CORPORATION. All rights reserved.
Copyright 2017-2018 The TensorFlow Authors. All rights reserved.

Various files include modifications © NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: MOFED driver for multi-node communication was not detected.
Multi-node communication performance may be reduced.

MMAR_ROOT set to /workspace/annotation_ct_liver_v1/commands/…
2019-08-25 01:46:44,716 - ImagePipeline - INFO - Data Property: {‘num_label_channels’: 1, ‘label_format’: None, ‘num_data_dims’: 3, ‘crop_size’: [96, 96, 96], ‘data_format’: ‘channels_first’, ‘task’: ‘segmentation’, ‘num_channels’: 2}
2019-08-25 01:46:44,716 - ImagePipeline - INFO - Data Property: {‘task’: ‘segmentation’, ‘num_label_channels’: 1, ‘num_data_dims’: 3, ‘label_format’: None, ‘data_format’: ‘channels_first’, ‘crop_size’: [96, 96, 96], ‘num_channels’: 2}
2019-08-25 01:46:44,739 - ImagePipeline - INFO - Data Property: {‘num_label_channels’: 1, ‘label_format’: None, ‘num_data_dims’: 3, ‘crop_size’: [96, 96, 96], ‘data_format’: ‘channels_first’, ‘task’: ‘segmentation’, ‘num_channels’: 2}
2019-08-25 01:46:44,739 - ImagePipeline - INFO - Data Property: {‘task’: ‘segmentation’, ‘num_label_channels’: 1, ‘num_data_dims’: 3, ‘label_format’: None, ‘data_format’: ‘channels_first’, ‘crop_size’: [96, 96, 96], ‘num_channels’: 2}
transpose to channels_last. input shape: (?, 96, 96, 96, 2)
net_config: {‘final_activation’: ‘softmax’, ‘plane’: ‘z’, ‘if_from_scratch’: False, ‘if_use_psp’: False, ‘pretrain_weight_name’: ‘/var/tmp/resnet50_weights_tf_dim_ordering_tf_kernels.h5’}
transpose to channels_last. input shape: (?, 96, 96, 96, 2)
net_config: {‘if_use_psp’: False, ‘plane’: ‘z’, ‘if_from_scratch’: False, ‘pretrain_weight_name’: ‘/var/tmp/resnet50_weights_tf_dim_ordering_tf_kernels.h5’, ‘final_activation’: ‘softmax’}
input is channels_last!
input is channels_last!
Stage 1 (?, 24, 24, 24, 64)
Stage 1 (?, 24, 24, 24, 64)
Stage 2 (?, 24, 24, 24, 256)
Stage 2 (?, 24, 24, 24, 256)
Stage 3 (?, 12, 12, 12, 512)
Stage 3 (?, 12, 12, 12, 512)
Stage 4 (?, 6, 6, 6, 1024)
Stage 4 (?, 6, 6, 6, 1024)
Stage 5 (?, 3, 3, 3, 2048)
Stage 5 (?, 3, 3, 3, 2048)
Stage 6 (?, 6, 6, 6, 1084)
Stage 6 (?, 6, 6, 6, 1084)
Stage 7 (?, 12, 12, 12, 572)
Stage 7 (?, 12, 12, 12, 572)
Stage 8 (?, 24, 24, 24, 316)
Stage 8 (?, 24, 24, 24, 316)
Stage 9 (?, 24, 24, 24, 124)
Stage 9 (?, 24, 24, 24, 124)
Stage 10 (?, 48, 48, 48, 124)
Stage 10 (?, 48, 48, 48, 124)
Final activation softmax
Final (?, 96, 96, 96, 2)
transpose back to channels_first
dice_loss targets [None, 1, 96, 96, 96] predictions [None, 2, 96, 96, 96] targets.dtype <dtype: ‘float32’> predictions.dtype <dtype: ‘float32’>
dice_loss is_channels_first: True skip_background: False is_onehot_targets False
Make multi-gpu optimizer
Final activation softmax
Final (?, 96, 96, 96, 2)
transpose back to channels_first
dice_loss targets [None, 1, 96, 96, 96] predictions [None, 2, 96, 96, 96] targets.dtype <dtype: ‘float32’> predictions.dtype <dtype: ‘float32’>
dice_loss is_channels_first: True skip_background: False is_onehot_targets False
Make multi-gpu optimizer
2019-08-25 01:47:10.295399: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3408000000 Hz
2019-08-25 01:47:10.296042: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x2ee2f300 executing computations on platform Host. Devices:
2019-08-25 01:47:10.296093: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): ,
2019-08-25 01:47:10.314621: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3408000000 Hz
2019-08-25 01:47:10.314966: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x2a470ab0 executing computations on platform Host. Devices:
2019-08-25 01:47:10.314997: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): ,
2019-08-25 01:47:10.681981: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-25 01:47:10.681977: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-25 01:47:10.683730: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x2ef0e130 executing computations on platform CUDA. Devices:
2019-08-25 01:47:10.683756: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2019-08-25 01:47:10.683796: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x2a54f8d0 executing computations on platform CUDA. Devices:
2019-08-25 01:47:10.683819: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2019-08-25 01:47:10.683894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7715
pciBusID: 0000:01:00.0
totalMemory: 7.92GiB freeMemory: 7.81GiB
2019-08-25 01:47:10.683915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-08-25 01:47:10.683943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7715
pciBusID: 0000:04:00.0
totalMemory: 7.93GiB freeMemory: 7.82GiB
2019-08-25 01:47:10.683963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 1
2019-08-25 01:47:30.687713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-25 01:47:30.687802: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 1
2019-08-25 01:47:30.687815: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: N
2019-08-25 01:47:30.687946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7537 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080, pci bus id: 0000:04:00.0, compute capability: 6.1)
2019-08-25 01:47:30,689 - LegacyFitter - INFO - CLEAN START (global_variables_initializer)
2019-08-25 01:47:30.722799: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-25 01:47:30.722880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-08-25 01:47:30.722899: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-08-25 01:47:30.723055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7532 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-08-25 01:47:30,724 - LegacyFitter - INFO - CLEAN START (global_variables_initializer)
2019-08-25 01:48:52,402 - LegacyFitter - INFO - Requested train epochs: 2000; iterations: 26
2019-08-25 01:49:13,445 - LegacyFitter - INFO - Requested train epochs: 2000; iterations: 26
/usr/local/lib/python3.5/dist-packages/skimage/transform/_warps.py:105: UserWarning: The default mode, ‘constant’, will be changed to ‘reflect’ in skimage 0.15.
warn("The default mode, ‘constant’, will be changed to ‘reflect’ in "
/usr/local/lib/python3.5/dist-packages/skimage/transform/_warps.py:110: UserWarning: Anti-aliasing will be enabled by default in skimage 0.15 to avoid aliasing artifacts when down-sampling images.
warn("Anti-aliasing will be enabled by default in skimage 0.15 to "
/usr/local/lib/python3.5/dist-packages/skimage/transform/_warps.py:105: UserWarning: The default mode, ‘constant’, will be changed to ‘reflect’ in skimage 0.15.
warn("The default mode, ‘constant’, will be changed to ‘reflect’ in "
/usr/local/lib/python3.5/dist-packages/skimage/transform/_warps.py:110: UserWarning: Anti-aliasing will be enabled by default in skimage 0.15 to avoid aliasing artifacts when down-sampling images.
warn("Anti-aliasing will be enabled by default in skimage 0.15 to "
2019-08-25 01:51:34.583854: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:704] Iteration = 0, topological sort failed with message: The graph couldn’t be sorted in topological order.
2019-08-25 01:51:37.909085: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:704] Iteration = 1, topological sort failed with message: The graph couldn’t be sorted in topological order.
2019-08-25 01:51:56.714283: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:704] Iteration = 0, topological sort failed with message: The graph couldn’t be sorted in topological order.
2019-08-25 01:51:59.075839: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:704] Iteration = 1, topological sort failed with message: The graph couldn’t be sorted in topological order.
2019-08-25 01:53:59.601785: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:704] Iteration = 0, topological sort failed with message: The graph couldn’t be sorted in topological order.
2019-08-25 01:54:02.886355: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:704] Iteration = 1, topological sort failed with message: The graph couldn’t be sorted in topological order.
2019-08-25 01:54:17.243781: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-08-25 01:54:23.341263: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:704] Iteration = 0, topological sort failed with message: The graph couldn’t be sorted in topological order.
2019-08-25 01:54:25.731433: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:704] Iteration = 1, topological sort failed with message: The graph couldn’t be sorted in topological order.
2019-08-25 01:56:10.457738: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2b546ceb130a:49:100 [0] NCCL INFO NET : Using interface eth0:172.17.0.2<0>
2b546ceb130a:49:100 [0] NCCL INFO NET/IB : Using interface eth0 for sideband communication
2b546ceb130a:49:100 [0] NCCL INFO Using internal Network Socket
2b546ceb130a:49:100 [0] NCCL INFO NET : Using interface eth0:172.17.0.2<0>
2b546ceb130a:49:100 [0] NCCL INFO NET/Socket : 1 interfaces found
NCCL version 2.3.7+cuda10.0
2b546ceb130a:49:100 [0] NCCL INFO rank 0 nranks 2
2b546ceb130a:50:99 [1] NCCL INFO NET : Using interface eth0:172.17.0.2<0>
2b546ceb130a:50:99 [1] NCCL INFO NET/IB : Using interface eth0 for sideband communication
2b546ceb130a:50:99 [1] NCCL INFO Using internal Network Socket
2b546ceb130a:50:99 [1] NCCL INFO rank 1 nranks 2
2b546ceb130a:50:99 [1] NCCL INFO comm 0x7f111c04f1d0 rank 1 nranks 2
2b546ceb130a:50:99 [1] NCCL INFO NET : Using interface eth0:172.17.0.2<0>
2b546ceb130a:50:99 [1] NCCL INFO NET/Socket : 1 interfaces found
2b546ceb130a:49:100 [0] NCCL INFO comm 0x7fe7780881f0 rank 0 nranks 2
2b546ceb130a:50:99 [1] NCCL INFO Could not find real path of /sys/class/net/eth0/device
2b546ceb130a:50:99 [1] NCCL INFO CUDA Dev 1, IP Interfaces : eth0(SOC)
2b546ceb130a:49:100 [0] NCCL INFO Could not find real path of /sys/class/net/eth0/device
2b546ceb130a:49:100 [0] NCCL INFO CUDA Dev 0, IP Interfaces : eth0(SOC)
2b546ceb130a:49:100 [0] NCCL INFO Using 256 threads
2b546ceb130a:49:100 [0] NCCL INFO Min Comp Cap 6
2b546ceb130a:49:100 [0] NCCL INFO Ring 00 : 0 1
2b546ceb130a:50:99 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via direct shared memory
2b546ceb130a:49:100 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
2b546ceb130a:49:100 [0] NCCL INFO comm 0x7fe7780881f0 rank 0 nranks 2 - COMPLETE
2b546ceb130a:50:99 [1] NCCL INFO comm 0x7f111c04f1d0 rank 1 nranks 2 - COMPLETE
2b546ceb130a:49:100 [0] NCCL INFO Launch mode Parallel
Epoch: 1/2000, Iter: 1/26 [ ] train_loss: 0.5711 time: 414.04s
Epoch: 1/2000, Iter: 2/26 [= ] train_loss: 0.5726 time: 2.45s
Epoch: 1/2000, Iter: 3/26 [== ] train_loss: 0.5686 time: 0.63s
Epoch: 1/2000, Iter: 4/26 [=== ] train_loss: 0.5673 time: 0.63s
Epoch: 1/2000, Iter: 5/26 [=== ] train_loss: 0.5662 time: 44.72s
Epoch: 1/2000, Iter: 6/26 [==== ] train_loss: 0.5640 time: 66.34s
Epoch: 1/2000, Iter: 7/26 [===== ] train_loss: 0.5629 time: 0.96s
Epoch: 1/2000, Iter: 8/26 [====== ] train_loss: 0.5603 time: 0.58s
Epoch: 1/2000, Iter: 9/26 [====== ] train_loss: 0.5587 time: 0.58s
Epoch: 1/2000, Iter: 10/26 [======= ] train_loss: 0.5579 time: 0.60s
Epoch: 1/2000, Iter: 11/26 [======== ] train_loss: 0.5565 time: 0.58s
Epoch: 1/2000, Iter: 12/26 [========= ] train_loss: 0.5544 time: 7.40s
Epoch: 1/2000, Iter: 13/26 [========== ] train_loss: 0.5538 time: 53.20s
Epoch: 1/2000, Iter: 14/26 [========== ] train_loss: 0.5533 time: 42.11s
Epoch: 1/2000, Iter: 15/26 [=========== ] train_loss: 0.5517 time: 46.14s
Epoch: 1/2000, Iter: 16/26 [============ ] train_loss: 0.5497 time: 0.59s
Epoch: 1/2000, Iter: 17/26 [============= ] train_loss: 0.5472 time: 25.10s
Epoch: 1/2000, Iter: 18/26 [============= ] train_loss: 0.5447 time: 0.58s
Epoch: 1/2000, Iter: 19/26 [============== ] train_loss: 0.5423 time: 10.44s
Epoch: 1/2000, Iter: 20/26 [=============== ] train_loss: 0.5397 time: 24.34s
Epoch: 1/2000, Iter: 21/26 [================ ] train_loss: 0.5370 time: 0.56s
Epoch: 1/2000, Iter: 22/26 [================ ] train_loss: 0.5340 time: 0.59s
Epoch: 1/2000, Iter: 23/26 [================= ] train_loss: 0.5315 time: 0.54s
Epoch: 1/2000, Iter: 24/26 [================== ] train_loss: 0.5276 time: 0.56s
Epoch: 1/2000, Iter: 25/26 [=================== ] train_loss: 0.5242 time: 20.99s
Epoch: 1/2000, Iter: 26/26 [====================] train_loss: 0.5206 time: 0.56s
2019-08-25 02:02:24,630 - LegacyFitter - INFO - This epoch: 791.18s; per epoch: 791.18s; elapsed: 791.18s; remaining: 1581577.07s; best metric: -1000000 at epoch 0
Epoch: 2/2000, Iter: 1/26 [ ] train_loss: 0.4235 time: 99.07s
Epoch: 2/2000, Iter: 2/26 [= ] train_loss: 0.4092 time: 0.58s
Epoch: 2/2000, Iter: 3/26 [== ] train_loss: 0.4019 time: 53.62s
Epoch: 2/2000, Iter: 4/26 [=== ] train_loss: 0.3962 time: 0.54s
Epoch: 2/2000, Iter: 5/26 [=== ] train_loss: 0.3945 time: 29.20s
Epoch: 2/2000, Iter: 6/26 [==== ] train_loss: 0.3891 time: 23.16s
Epoch: 2/2000, Iter: 7/26 [===== ] train_loss: 0.3807 time: 120.78s
Epoch: 2/2000, Iter: 8/26 [====== ] train_loss: 0.3758 time: 0.58s
Epoch: 2/2000, Iter: 9/26 [====== ] train_loss: 0.3694 time: 0.58s
Epoch: 2/2000, Iter: 10/26 [======= ] train_loss: 0.3636 time: 0.57s
Epoch: 2/2000, Iter: 11/26 [======== ] train_loss: 0.3570 time: 0.58s
Epoch: 2/2000, Iter: 12/26 [========= ] train_loss: 0.3509 time: 33.53s
Epoch: 2/2000, Iter: 13/26 [========== ] train_loss: 0.3457 time: 200.16s
Epoch: 2/2000, Iter: 14/26 [========== ] train_loss: 0.3394 time: 0.58s
Epoch: 2/2000, Iter: 15/26 [=========== ] train_loss: 0.3331 time: 0.58s
Epoch: 2/2000, Iter: 16/26 [============ ] train_loss: 0.3271 time: 0.58s
Epoch: 2/2000, Iter: 17/26 [============= ] train_loss: 0.3214 time: 11.01s
Epoch: 2/2000, Iter: 18/26 [============= ] train_loss: 0.3151 time: 83.98s
Epoch: 2/2000, Iter: 19/26 [============== ] train_loss: 0.3093 time: 0.52s
Epoch: 2/2000, Iter: 20/26 [=============== ] train_loss: 0.3042 time: 53.77s
Epoch: 2/2000, Iter: 21/26 [================ ] train_loss: 0.2981 time: 0.57s
Epoch: 2/2000, Iter: 22/26 [================ ] train_loss: 0.2919 time: 0.56s
Epoch: 2/2000, Iter: 23/26 [================= ] train_loss: 0.2875 time: 0.56s
Epoch: 2/2000, Iter: 24/26 [================== ] train_loss: 0.2816 time: 25.62s
Epoch: 2/2000, Iter: 25/26 [=================== ] train_loss: 0.2748 time: 0.55s
Epoch: 2/2000, Iter: 26/26 [====================] train_loss: 0.2682 time: 0.56s
2019-08-25 02:14:47,201 - LegacyFitter - INFO - This epoch: 742.56s; per epoch: 766.87s; elapsed: 1533.75s; remaining: 1532211.50s; best metric: -1000000 at epoch 0
Epoch: 3/2000, Iter: 1/26 [ ] train_loss: 0.1372 time: 96.06s
Epoch: 3/2000, Iter: 2/26 [= ] train_loss: 0.1175 time: 57.75s
Epoch: 3/2000, Iter: 3/26 [== ] train_loss: 0.1098 time: 130.69s
Epoch: 3/2000, Iter: 4/26 [=== ] train_loss: 0.1070 time: 0.52s
Epoch: 3/2000, Iter: 5/26 [=== ] train_loss: 0.1044 time: 14.28s
Epoch: 3/2000, Iter: 6/26 [==== ] train_loss: 0.1038 time: 0.55s
Epoch: 3/2000, Iter: 7/26 [===== ] train_loss: 0.1039 time: 46.21s
Epoch: 3/2000, Iter: 8/26 [====== ] train_loss: 0.1033 time: 0.53s
Epoch: 3/2000, Iter: 9/26 [====== ] train_loss: 0.1002 time: 0.56s
Epoch: 3/2000, Iter: 10/26 [======= ] train_loss: 0.0979 time: 43.84s
Epoch: 3/2000, Iter: 11/26 [======== ] train_loss: 0.1022 time: 52.91s
Epoch: 3/2000, Iter: 12/26 [========= ] train_loss: 0.0992 time: 0.59s
WARNING: One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Stalled ops:
DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_final_dense_block_final_deconv_conv3d_transpose_grad_tuple_control_dependency_0 [missing ranks: 0]
DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_final_dense_block_final_conv_conv3d_Conv3D_grad_tuple_control_dependency_1_0 [missing ranks: 0]
DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_final_dense_block_final_bn_batch_normalization_batchnorm_sub_grad_tuple_control_dependency_0 [missing ranks: 0]
DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_stage10_dense_block_pseudo_3d_2_10b_iter2_bn4_batch_normalization_batchnorm_sub_grad_tuple_control_dependency_0 [missing ranks: 0]
DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_stage10_dense_block_pseudo_3d_2_10b_iter2_bn4_batch_normalization_batchnorm_mul_grad_tuple_control_dependency_1_0 [missing ranks: 0]
DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_stage10_dense_block_pseudo_3d_2_10b_iter2_conv3_conv3d_Conv3D_grad_tuple_control_dependency_1_0 [missing ranks: 0]
DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_stage10_dense_block_pseudo_3d_2_10b_iter2_bn3_batch_normalization_batchnorm_mul_grad_tuple_control_dependency_1_0 [missing ranks: 0]
DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_stage10_dense_block_pseudo_3d_2_10b_iter2_conv2_conv3d_Conv3D_grad_tuple_control_dependency_1_0 [missing ranks: 0]
DistributedAdamOptim

It’s heard that the temperature protection for Geforce series GPU (often 80 °C) is existed. You can check the running temperature via nvidia-smi command. The GPU will decrease the speed if the temperature threshold is reached.

Thanks for your advice.

I reruned the train_2gpu.sh script, the temperature and power usage were both slow.

When GPU was used and not used , I took snapshot of nvidia-smi outputs

Tue Aug 27 08:40:29 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:01:00.0 Off | N/A |
| 32% 56C P2 123W / 200W | 7895MiB / 8114MiB | 100% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 1080 Off | 00000000:04:00.0 Off | N/A |
| 29% 51C P2 120W / 200W | 7901MiB / 8119MiB | 94% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 29696 C python3 7885MiB |
| 1 29697 C python3 7891MiB |
±----------------------------------------------------------------------------+

#######################################################

Tue Aug 27 08:50:28 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:01:00.0 Off | N/A |
| 29% 54C P8 15W / 200W | 7895MiB / 8114MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 1080 Off | 00000000:04:00.0 Off | N/A |
| 22% 47C P8 8W / 200W | 7901MiB / 8119MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 29696 C python3 7885MiB |
| 1 29697 C python3 7891MiB |
±----------------------------------------------------------------------------+

If other people’s clara project works good, may be the problem is caused by my configuration or python envrionment, I will guess and try something later.

Hi

To increase performance you should start with 1 gpu and try:

  1. Utilize all gpu memory. For this you can use nvidia-smi to check on how much memory is used then increase your batch size in the train config until you are using all the memory.
  2. Make sure you have enough cpu threads/ workers doing pre fetch and data augmentation to keep the gpu working at full utilization. For this use
    watch -n 1  nvidia-smi
    

    Then keep increasing the num_workers and the prefetch_size in the train config section below. you typically want your prefetch size =2x batch size

    "image_pipeline": {
                "name": "ImagePipeline",
                "args": {
                    "task": "segmentation",
                    "data_list_file_path": "{DATASET_JSON}",
                    "data_file_base_dir": "{DATA_ROOT}",
                    "data_list_key": "training",
                    "crop_size": [96, 96, 32],
                    "data_format": "channels_first",
                    "batch_size": 4,
                    "num_channels": 2,
                    "num_workers": 8,
                    "prefetch_size": 10
                }
            }
    

Once you see a constant 100% gpu utilization with no drop to in utilization you should move to using 2+ gpus by using the train2gpu.sh script

Hope this helps

Thanks for your advice!

After some test I have found the solution. Some exam in Task03_Liver has very large scan region. After I crop them into a samller region thats only contains liver organ with 32 pixel padding, move ScaleIntensityRange step behind the CropForegroundObject step and remove NPRandomZoom3D step, the preprocessing time decreased a lot.