Extremely slow train and evaluation of yolo_v4_tiny

• Hardware (8 x V100)
• Network Type (Yolo_v4, yolo_v4_tiny)
• Training spec

random_seed: 42
yolov4_config {
  big_anchor_shape: "[(23.06, 27.89), (35.12, 41.84), (39.06, 46.57)]"
  mid_anchor_shape: "[(35.74, 51.05), (44.68, 42.58), (43.01, 48.06)]"
  box_matching_iou: 0.25
  matching_neutral_box_iou: 0.5
  arch: "cspdarknet_tiny"
  loss_loc_weight: 1.0
  loss_neg_obj_weights: 1.0
  loss_class_weights: 1.0
  label_smoothing: 0.0
  big_grid_xy_extend: 0.05
  mid_grid_xy_extend: 0.05
  freeze_bn: false
  #freeze_blocks: 0
  force_relu: false
training_config {
  batch_size_per_gpu: 8
  num_epochs: 300
  enable_qat: false
  checkpoint_interval: 5
  learning_rate {
    soft_start_cosine_annealing_schedule {
      min_learning_rate: 1e-7
      max_learning_rate: 1e-4
      soft_start: 0.3
  regularizer {
    type: L1
    weight: 3e-5
  optimizer {
    adam {
      epsilon: 1e-7
      beta1: 0.9
      beta2: 0.999
      amsgrad: false
  max_queue_size: 20
  use_multiprocessing: true
  n_workers: 10
eval_config {
  average_precision_mode: SAMPLE
  batch_size: 32
  matching_iou_threshold: 0.5
nms_config {
  confidence_threshold: 0.25
  clustering_iou_threshold: 0.5
  top_k: 200
augmentation_config {
  hue: 0.5
  saturation: 1.5
  horizontal_flip: 0.5
  jitter: 0.3
  output_width: 512
  output_height: 512
  output_channel: 3
  randomize_input_shape_period: 10
  mosaic_prob: 0

• How to reproduce the issue? (This is for errors. Please share the command line and the detailed log here.)

!tao yolo_v4_tiny train -e configs/yolov4_training_conf.txt \
                       -r /workspace/checkpoints \
                       -k nvidia \
                       --gpus 8 \

The GPU utilisation is around 0 with spikes per batch.
CPU usage goes to 100% for all 80 cores to load a batch.

As you have seen I played with

training_config {  
  max_queue_size: 20
  use_multiprocessing: true
  n_workers: 10

augmentation_config {
  mosaic_prob: 0.0

That didn’t change a lot.

I also observed that both EVAL and TRAIN phases are slow.
Evaluation on 2000 images with batch size 32 ~ 63 batches takes 160 seconds for 8 GPUs or 0.63 seconds per image.
I use TFRecords for annotations. Initial image sizes 2464 × 2056, network input is 512x512.

Please set randomize_input_shape_period to 0 and retry.

I tried

randomize_input_shape_period: 0

This gave 10% percent improvement, which is still slow.
The pattern of GPU vs CPU utilization didn’t change.

Please add force_on_cpu: true in nms_config.

nms_config {
  confidence_threshold: 0.25
  clustering_iou_threshold: 0.5
  force_on_cpu: true
  top_k: 200
  force_on_cpu: true

didn’t change throughput and gpu/cpu utilization pattern.

Thanks for the info. We will check further.

I have the same issue with YoloV4. My hardware is a Supermicro Server 4028gr-tvrt with 8x Tesla V100.

For yolov4, you can use deeper backbone. For example, resnet101.
The gpu utilization will get higher.

Is it normal for the CPU usage to be so high? My cores are all at almost 100%. My GPU memory is also at full capacity. I can’t set my batch size higher than 4, otherwise I’ll run out of memory.

The resnet101 is a large network, so it consumes GPU memory. And also the augmentation consumes cpu resources. You can use different backbones which can support yolov4. For example, resnet50.

We are also doing some improvement to improve GPU utilization. Hopefully it will be available in next release.

I tried resnet101 as a backbone. The GPU utilization is only slightly higher. Most of the time all GPUs are at 0%. Training for 1 epoch takes about 2 hours with Resnet18 as well as Resnet101.

I also tried this:

Please disable above augmentation and retry. Thanks.
hue: 0.0
saturation: 1.0

