Missing ranks

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
H100
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
LPRnet
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
5.0.0-tf1.15.5
• Training spec file(If have, please share here)

random_seed: 42
lpr_config {
  hidden_units: 512
  max_label_length: 12
  arch: "baseline"
  nlayers: 18 #setting nlayers to be 10 to use baseline10 model
}
training_config {
  batch_size_per_gpu: 2048
  num_epochs: 150
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 1e-6
    max_learning_rate: 1e-5
    soft_start: 0.001
    annealing: 0.5
  }
  }
  regularizer {
    type: L2
    weight: 5e-4
  }
}
eval_config {
  validation_period_during_training: 5
  batch_size: 1
}
augmentation_config {
    output_width: 100
    output_height: 48
    output_channel: 3
    max_rotate_degree: 5
    rotate_prob: 0.5
    gaussian_kernel_size: 5
    gaussian_kernel_size: 7
    gaussian_kernel_size: 15
    blur_prob: 0.5
    reverse_color_prob: 0.5
    keep_original_prob: 0.3
}
dataset_config {
  data_sources: {
    label_directory_path: "/workspace/tao-training/LPRnet_training/dataset/char/train/labels"
    image_directory_path: "/workspace/tao-training/LPRnet_training/dataset/char/train/images"
  }
  characters_list_file: "/workspace/tao-training/LPRnet_training/us_lp_characters.txt"
  validation_data_sources: {
    label_directory_path: "/workspace/tao-training/LPRnet_training/dataset/char/val/labels"
    image_directory_path: "/workspace/tao-training/LPRnet_training/dataset/char/val/images"
  }
}

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Missing ranks:
0: [training/DistributedSGD_Allreduce/cond/HorovodAllreduce_training_DistributedSGD_gradients_gradients_AddN_58_0, training/DistributedSGD_Allreduce/cond_1/HorovodAllreduce_training_DistributedSGD_gradients_gradients_AddN_57_0, training/DistributedSGD_Allreduce/cond_10/HorovodAllreduce_training_DistributedSGD_gradients_gradients_AddN_53_0, training/DistributedSGD_Allreduce/cond_11/HorovodAllreduce_training_DistributedSGD_gradients_gradients_AddN_51_0, training/DistributedSGD_Allreduce/cond_12/HorovodAllreduce_training_DistributedSGD_gradients_gradients_bn2a_branch1_FusedBatchNormV3_grad_FusedBatchNormGradV3_1, training/DistributedSGD_Allreduce/cond_13/HorovodAllreduce_training_DistributedSGD_gradients_gradients_bn2a_branch1_FusedBatchNormV3_grad_FusedBatchNormGradV3_2 ...]

This is the error I am facing when I trained it and saved in the fifith epoch

Does the training hang? Please share the full log.
Is it running with multi gpus? If yes, please check if one gpu can work.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.