resnet_18_DetectNet_v2 Training UnknownError

i am trying to train on Pascal Voc dataset, i prepared the dataset i.e., converted it to kitti format.
generated all necessary spec_files after resolving some common errors the training started successfully.
Later this UnknowError occured i am bit confused now.

Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 10, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 37, in main
  File "</usr/local/lib/python2.7/dist-packages/decorator.pyc:decorator-gen-2>", line 2, in main
  File "./detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
  File "./detectnet_v2/scripts/train.py", line 632, in main
  File "./detectnet_v2/scripts/train.py", line 556, in run_experiment
  File "./detectnet_v2/scripts/train.py", line 490, in train_gridbox
  File "./detectnet_v2/scripts/train.py", line 136, in run_training_loop
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1327, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1091, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
     [[node DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_resnet18_nopool_bn_detectnet_v2_bn_conv1_FusedBatchNorm_grad_tuple_control_dependency_2_0 (defined at <string>:78) ]]
     [[node DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_resnet18_nopool_bn_detectnet_v2_block_3a_bn_1_FusedBatchNorm_grad_tuple_control_dependency_2_0 (defined at <string>:78) ]]

Caused by op u'DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_resnet18_nopool_bn_detectnet_v2_bn_conv1_FusedBatchNorm_grad_tuple_control_dependency_2_0', defined at:
  File "/usr/local/bin/tlt-train-g1", line 10, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 37, in main
  File "</usr/local/lib/python2.7/dist-packages/decorator.pyc:decorator-gen-2>", line 2, in main
  File "./detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
  File "./detectnet_v2/scripts/train.py", line 632, in main
  File "./detectnet_v2/scripts/train.py", line 556, in run_experiment
  File "./detectnet_v2/scripts/train.py", line 466, in train_gridbox
  File "./detectnet_v2/scripts/train.py", line 320, in build_training_graph
  File "./detectnet_v2/model/detectnet_model.py", line 496, in build_training_graph
  File "./detectnet_v2/training/train_op_generator.py", line 60, in get_train_op
  File "./detectnet_v2/training/train_op_generator.py", line 75, in _get_train_op_without_cost_scaling
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 403, in minimize
    grad_loss=grad_loss)
  File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py", line 230, in compute_gradients
    avg_grads = self._allreduce_grads(grads)
  File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py", line 209, in allreduce_grads
    for grad in grads]
  File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py", line 88, in allreduce
    summed_tensor_compressed = _allreduce(tensor_compressed)
  File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/mpi_ops.py", line 91, in _allreduce
    return MPI_LIB.horovod_allreduce(tensor, name=name)
  File "<string>", line 78, in horovod_allreduce
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
     [[node DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_resnet18_nopool_bn_detectnet_v2_bn_conv1_FusedBatchNorm_grad_tuple_control_dependency_2_0 (defined at <string>:78) ]]
     [[node DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_resnet18_nopool_bn_detectnet_v2_block_3a_bn_1_FusedBatchNorm_grad_tuple_control_dependency_2_0 (defined at <string>:78) ]]

Here is my spec_file:

random_seed: 42
model_config {
  pretrained_model_file: "/workspace/pretrained_model/tlt_resnet18_detectnet_v2_v1/resnet18.hdf5"
  num_layers: 18
  freeze_blocks: 0
  arch: "resnet"
  use_batch_norm: true
  objective_set: {
    cov {}
    bbox {
      scale: 35.0
      offset: 0.5
    }
  }
  training_precision {
    backend_floatx: FLOAT32
 }
}

bbox_rasterizer_config {
  target_class_config {
    key: "car"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "bicycle"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "person"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }
  deadzone_radius: 0.67
}

cost_function_config {
  target_classes {
    name: "car"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "bicycle"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 1.0
    }
  }
  target_classes {
    name: "person"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  enable_autoweighting: True
  max_objective_weight: 0.9999
  min_objective_weight: 0.0001
}

training_config {
  batch_size_per_gpu: 32
  num_epochs: 20
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-6
      max_learning_rate: 5e-4
      soft_start: 0.1
      annealing: 0.7
    }
  }
  regularizer {
    type: L1
    weight: 3e-9
  }
  optimizer {
    adam {
      epsilon: 1e-08
      beta1: 0.9
      beta2: 0.999
    }
  }
  cost_scaling {
    enabled: False
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
}

augmentation_config {
  preprocessing {
    output_image_width: 480
    output_image_height: 320
    output_image_channel: 3
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {
    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    color_shift_stddev: 0.0
    hue_rotation_max: 25.0
    saturation_shift_max: 0.2
    contrast_scale_max: 0.1
    contrast_center: 0.5
  }
}

postprocessing_config {
  target_class_config {
    key: "car"
    value: {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.13
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 1
      }
    }
  }
  target_class_config {
    key: "bicycle"
    value: {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 1
      }
    }
  }
  target_class_config {
    key: "person"
    value: {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 1
      }
    }
  }
}

dataset_config {
  data_sources: {
    tfrecords_path: "/workspace/tf_records/*"
    image_directory_path: "/workspace/dataset/VOCdevkit/VOC2012"
  }
  image_extension: "jpg"
  target_class_mapping {
      key: "car"
      value: "car"
  }
  target_class_mapping {
      key: "person"
      value: "person"
  }
  target_class_mapping {
      key: "bicycle"
      value: "bicycle"
  }
  validation_fold: 0
}

evaluation_config {
  validation_period_during_training: 10
  first_validation_epoch: 1
  minimum_detection_ground_truth_overlap {
    key: "car"
    value: 0.7
  }
  minimum_detection_ground_truth_overlap {
    key: "bicycle"
    value: 0.5
  }
  minimum_detection_ground_truth_overlap {
    key: "person"
    value: 0.5
  }
}

Same root cause as https://devtalk.nvidia.com/default/topic/1064911/transfer-learning-toolkit/tlt-first-tutorial-error/ or https://devtalk.nvidia.com/default/topic/1066439/transfer-learning-toolkit/unkown-error-by-horovod/?offset=3#5400841

Hey, since we are dealing with the same error on https://devtalk.nvidia.com/default/topic/1066439/transfer-learning-toolkit/unkown-error-by-horovod/?offset=3#5400841
it’s better i don’t clutter the forum. i didn’t found any option to delete this topic.
how can i delete this one or if you can delete then please do it.

Issue is under tracking at https://devtalk.nvidia.com/default/topic/1066439/transfer-learning-toolkit/unkown-error-by-horovod/?offset=3#5400841