Tao pre-trained yolo4tiny - AssertionError: Must have more boxes than clusters

I changed the num_shards to 1 and it seems to work. could you please elaborate why that helped? and if that is the correct fix

The training dataset is only 8.
The val dataset is only 1.
So, if set num_shards to 10, there will be 0 size of tfrecords for val dataset.

Please use more train/val dataset.

this is weird because the training folder contains 90 images and the val has 10 images (I’m aware it’s a small collection but for now it’s ok).
The files are under /workspace/tao-experiments/data/chimera_ir_training/images and corresponding also the labels.
also the output for the data-set creation is

!python3 generate_val_dataset.py --input_image_dir=$LOCAL_DATA_DIR/chimera_ir_training/images \
                                        --input_label_dir=$LOCAL_DATA_DIR/chimera_ir_training/labels/ \
                                        --output_dir=$LOCAL_DATA_DIR/chimera_ir_val
Total 99 samples in KITTI training dataset
90 for train and 9 for val

the tfrecods config file is as

kitti_config {
  root_directory_path: "/workspace/tao-experiments/data/chimera_ir_training"
  image_dir_name: "images"
  label_dir_name: "labels"
  image_extension: ".png"
  partition_mode: "random"
  num_partitions: 2
  val_split: 14
  num_shards: 10
}
image_directory_path: "/workspace/tao-experiments/data/chimera_ir_training"

so Im not sure why the training set only has 8 images and the val contains only one file?

Sorry, from the log you shared above, the validation dataset has totally 9 images.

Can you share your latest spec file?

train:

kitti_config {
  root_directory_path: "/workspace/tao-experiments/data/chimera_ir_training"
  image_dir_name: "images"
  label_dir_name: "labels"
  image_extension: ".png"
  partition_mode: "random"
  num_partitions: 2
  val_split: 14
  num_shards: 1
}
image_directory_path: "/workspace/tao-experiments/data/chimera_ir_training"

val:

kitti_config {
  root_directory_path: "/workspace/tao-experiments/data/chimera_ir_val"
  image_dir_name: "image"
  label_dir_name: "label"
  image_extension: ".png"
  partition_mode: "random"
  num_partitions: 2
  val_split: 14
  num_shards: 1
}
image_directory_path: "/workspace/tao-experiments/data/chimera_ir_val"

Sorry, I mean the training spec file. Your sharing is not.

train spec file:

random_seed: 42
yolov4_config {
  big_anchor_shape: "[(42.00, 117.33), (37.50, 54.67), (21.75, 88.00)]"
  mid_anchor_shape: "[(16.50, 53.33), (21.75, 33.33),  (9.75, 40.00)]"
  box_matching_iou: 0.25
  matching_neutral_box_iou: 0.5
  arch: "cspdarknet_tiny"
  loss_loc_weight: 1.0
  loss_neg_obj_weights: 1.0
  loss_class_weights: 1.0
  label_smoothing: 0.0
  big_grid_xy_extend: 0.05
  mid_grid_xy_extend: 0.05
  freeze_bn: false
  #freeze_blocks: 0
  force_relu: false
}
training_config {
  batch_size_per_gpu: 8
  num_epochs: 80
  enable_qat: false
  checkpoint_interval: 10
  learning_rate {
    soft_start_cosine_annealing_schedule {
      min_learning_rate: 1e-7
      max_learning_rate: 1e-4
      soft_start: 0.3
    }
  }
  regularizer {
    type: L1
    weight: 3e-5
  }
  optimizer {
    adam {
      epsilon: 1e-7
      beta1: 0.9
      beta2: 0.999
      amsgrad: false
    }
  }
  pretrain_model_path: "/workspace/tao-experiments/yolo_v4_tiny/pretrained_cspdarknet_tiny/pretrained_object_detection_vcspdarknet_tiny/cspdarknet_tiny.hdf5"
}
eval_config {
  average_precision_mode: SAMPLE
  batch_size: 8
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.001
  clustering_iou_threshold: 0.5
  force_on_cpu: true
  top_k: 200
}
augmentation_config {
  hue: 0.1
  saturation: 1.5
  exposure:1.5
  vertical_flip:0
  horizontal_flip: 0.5
  jitter: 0.3
  output_width: 640
  output_height: 480
  output_channel: 3
  randomize_input_shape_period: 10
  mosaic_prob: 0.5
  mosaic_min_ratio:0.2
}
dataset_config {
  data_sources: {
      tfrecords_path: "/workspace/tao-experiments/data/chimera_ir_training/tfrecords/train*"
      image_directory_path: "/workspace/tao-experiments/data/chimera_ir_training"
  }
  include_difficult_in_training: true
  image_extension: "png"
   target_class_mapping {
      key: "person"
      value: "person"
  }
  target_class_mapping {
      key: "car"
      value: "car"
  }
  target_class_mapping {
      key: "truck"
      value: "truck"
  }
  target_class_mapping {
      key: "tank"
      value: "tank"
  }
  validation_data_sources: {
      tfrecords_path: "/workspace/tao-experiments/data/chimera_ir_val/tfrecords/val*"
      image_directory_path: "/workspace/tao-experiments/data/chimera_ir_val"
  }
}

retrain spec:

yolov4_config {
  big_anchor_shape: "[(42.00, 117.33), (37.50, 54.67), (21.75, 88.00)]"
  mid_anchor_shape: "[(16.50, 53.33), (21.75, 33.33),  (9.75, 40.00)]"
  box_matching_iou: 0.25
  matching_neutral_box_iou: 0.5
  arch: "cspdarknet_tiny"
  loss_loc_weight: 1.0
  loss_neg_obj_weights: 1.0
  loss_class_weights: 1.0
  label_smoothing: 0.0
  big_grid_xy_extend: 0.05
  mid_grid_xy_extend: 0.05
  freeze_bn: false
  #freeze_blocks: 0
  force_relu: false
}
training_config {
  batch_size_per_gpu: 8
  num_epochs: 80
  enable_qat: false
  checkpoint_interval: 10
  learning_rate {
    soft_start_cosine_annealing_schedule {
      min_learning_rate: 1e-7
      max_learning_rate: 1e-4
      soft_start: 0.3
    }
  }
  regularizer {
    type: NO_REG
    weight: 3e-9
  }
  optimizer {
    adam {
      epsilon: 1e-7
      beta1: 0.9
      beta2: 0.999
      amsgrad: false
    }
  }
  pruned_model_path: "/workspace/tao-experiments/yolo_v4_tiny/experiment_dir_pruned/yolov4_cspdarknet_tiny_pruned.tlt"
}
eval_config {
  average_precision_mode: SAMPLE
  batch_size: 8
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.001
  clustering_iou_threshold: 0.5
  top_k: 200
  force_on_cpu: true
}
augmentation_config {
  hue: 0.1
  saturation: 1.5
  exposure:1.5
  vertical_flip:0
  horizontal_flip: 0.5
  jitter: 0.3
  output_width: 640
  output_height: 480
  output_channel: 3
  randomize_input_shape_period: 10
  mosaic_prob: 0.5
  mosaic_min_ratio:0.2
}
dataset_config {
  data_sources: {
      tfrecords_path: "/workspace/tao-experiments/data/chimera_ir_training/tfrecords/train*"
      image_directory_path: "/workspace/tao-experiments/data/chimera_ir_training"
  }
  include_difficult_in_training: true
  image_extension: "png"
  target_class_mapping {
      key: "person"
      value: "person"
  }
  target_class_mapping {
      key: "car"
      value: "car"
  }
  target_class_mapping {
      key: "truck"
      value: "truck"
  }
  target_class_mapping {
      key: "tank"
      value: "tank"
  }
  validation_data_sources: {
      tfrecords_path: "/workspace/tao-experiments/data/chimera_ir_val/tfrecords/val*"
      image_directory_path: "/workspace/tao-experiments/data/chimera_ir_val"
  }
}

Please remove all the 0 size tfrecords file.

how is it supposed to contribute ? and how can I avoid for creating them in the first place?
Thanks

Please use more images for validation set.

Previously, you set
val_split: 14
num_shards: 10

So, in the val tfrecords folder, it splits into two parts.
One is 9 * (1 - 14%) /10 , each shard has less than 1 image.
Another part is 9 * 14% /10, each shard has also less than 1 image.

It results in 0 size of tfrecord file. Need to delete them.

could you please elaborate why the verification set is divided into two groups? and splited into shards?

Because you are using dataset-converter.
See DetectNet_v2 — TAO Toolkit 3.21.11 documentation

1 Like

Many thanks for your help,
I having some problem with step #10 model export in the TAO notebook.
the instruction explains that For the jetson devices, please download the tao-converter for jetson from the dev zone link here.
so I downloaded the “Jetson” converter, and pulled the tlt and .bin files which been trained on the server, and follow the README instructions, so far so good, but when execute the ./tao-converter -h output the following msg:
./tao-converter: error while loading shared libraries: libnvinfer.so.7: cannot open shared object file: No such file or directory

side note:

ii  libnvidia-container0:arm64       0.10.0+jetpack                             arm64        NVIDIA container runtime library
ii  nvidia-container-csv-cuda        10.2.460-1                                 arm64        Jetpack CUDA CSV file
ii  nvidia-container-csv-cudnn       8.2.1.32-1+cuda10.2                        arm64        Jetpack CUDNN CSV file
ii  nvidia-container-csv-tensorrt    8.0.1.6-1+cuda10.2                         arm64        Jetpack TensorRT CSV file

Could you create a new topic for your latest error?
I think we already fix the original issue and some other issues.