Multi GPU training Error e

ishan · August 3, 2020, 4:13pm

I am using 8 gpus for training and randomly I get this error after some number of epochs:

[2020-08-03 11:25:36.480844: W horovod/common/operations.cc:588] training_1/SGD/DistributedSGD_Allreduce/HorovodAllreduce_training_1_SGD_gradients_block_1b_bn_2_1_FusedBatchNorm_grad_FusedBatchNormGrad_1 [missing ranks: 4]
[2020-08-03 11:25:36.480864: W horovod/common/operations.cc:588] training_1/SGD/DistributedSGD_Allreduce/HorovodAllreduce_training_1_SGD_gradients_block_1b_bn_2_1_FusedBatchNorm_grad_FusedBatchNormGrad_2 [missing ranks: 4]
[2020-08-03 11:25:36.480884: W horovod/common/operations.cc:588] training_1/SGD/DistributedSGD_Allreduce/HorovodAllreduce_training_1_SGD_gradients_AddN_52_0 [missing ranks: 4]

Morganh · August 4, 2020, 2:34am

The log is not harmful. It is some information from horovod.

ishan · August 4, 2020, 2:36am

At times it keeps displaying this message again and again - thus not resuming training anymore.

Morganh · August 4, 2020, 2:43am

I also observe these info during my experiments for multi gpus. But the training can work.
Please try to wait for a moment if your case happened.