Extremely slow train and evaluation of yolo_v4_tiny

Please provide the following information when requesting support.

• Hardware (8 x V100)
• Network Type (Yolo_v4, yolo_v4_tiny)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)

tao info --verbose
dockers:                                                                                                                                                                                                   
        nvidia/tao/tao-toolkit:                                                                                                                                                                            
                4.0.0-tf2.9.1:                                                                                                                                                                             
                        docker_registry: nvcr.io                                                                                                                                                           
                        tasks:                                                                                                                                                                             
                                1. classification_tf2                                                                                                                                                      
                                2. efficientdet_tf2                                                                                                                                                        
                4.0.0-tf1.15.5:                                                                                                                                                                            
                        docker_registry: nvcr.io                                                                                                                                                           
                        tasks:                                                                                                                                                                             
                                1. augment
                                2. bpnet
                                3. classification_tf1
                                4. detectnet_v2
                                5. dssd
                                6. emotionnet
                                7. efficientdet_tf1
                                8. faster_rcnn
                                9. fpenet
                                10. gazenet
                                11. gesturenet
                                12. heartratenet
                                13. lprnet
                                14. mask_rcnn
                                15. multitask_classification
                                16. retinanet
                                17. ssd
                                18. unet
                                19. yolo_v3
                                20. yolo_v4
                                21. yolo_v4_tiny
                                22. converter
...
Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit']
format_version: 2.0
toolkit_version: 4.0.1
published_date: 03/06/2023

• Training spec

random_seed: 42
yolov4_config {
  big_anchor_shape: "[(23.06, 27.89), (35.12, 41.84), (39.06, 46.57)]"
  mid_anchor_shape: "[(35.74, 51.05), (44.68, 42.58), (43.01, 48.06)]"
  box_matching_iou: 0.25
  matching_neutral_box_iou: 0.5
  arch: "cspdarknet_tiny"
  loss_loc_weight: 1.0
  loss_neg_obj_weights: 1.0
  loss_class_weights: 1.0
  label_smoothing: 0.0
  big_grid_xy_extend: 0.05
  mid_grid_xy_extend: 0.05
  freeze_bn: false
  #freeze_blocks: 0
  force_relu: false
}
training_config {
  batch_size_per_gpu: 8
  num_epochs: 300
  enable_qat: false
  checkpoint_interval: 5
  pretrain_model_path:"/workspace/weights/pretrained/pretrained_object_detection_vcspdarknet_tiny/cspdarknet_tiny.hdf5"
  learning_rate {
    soft_start_cosine_annealing_schedule {
      min_learning_rate: 1e-7
      max_learning_rate: 1e-4
      soft_start: 0.3
    }
  }
  regularizer {
    type: L1
    weight: 3e-5
  }
  optimizer {
    adam {
      epsilon: 1e-7
      beta1: 0.9
      beta2: 0.999
      amsgrad: false
    }
  }
  max_queue_size: 20
  use_multiprocessing: true
  n_workers: 10
}
eval_config {
  average_precision_mode: SAMPLE
  batch_size: 32
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.25
  clustering_iou_threshold: 0.5
  top_k: 200
}
augmentation_config {
  hue: 0.5
  saturation: 1.5
  exposure:1.5
  vertical_flip:0.5
  horizontal_flip: 0.5
  jitter: 0.3
  output_width: 512
  output_height: 512
  output_channel: 3
  randomize_input_shape_period: 10
  mosaic_prob: 0
  mosaic_min_ratio:0.2
}

• How to reproduce the issue? (This is for errors. Please share the command line and the detailed log here.)

!tao yolo_v4_tiny train -e configs/yolov4_training_conf.txt \
                       -r /workspace/checkpoints \
                       -k nvidia \
                       --gpus 8 \
                       --use_amp

The GPU utilisation is around 0 with spikes per batch.
CPU usage goes to 100% for all 80 cores to load a batch.

As you have seen I played with

training_config {  
  ...
  max_queue_size: 20
  use_multiprocessing: true
  n_workers: 10
}

augmentation_config {
  ...
  mosaic_prob: 0.0
  ...
}

That didn’t change a lot.

I also observed that both EVAL and TRAIN phases are slow.
Evaluation on 2000 images with batch size 32 ~ 63 batches takes 160 seconds for 8 GPUs or 0.63 seconds per image.
I use TFRecords for annotations. Initial image sizes 2464 × 2056, network input is 512x512.

Please set randomize_input_shape_period to 0 and retry.

I tried

randomize_input_shape_period: 0

This gave 10% percent improvement, which is still slow.
The pattern of GPU vs CPU utilization didn’t change.

Please add force_on_cpu: true in nms_config.

nms_config {
  confidence_threshold: 0.25
  clustering_iou_threshold: 0.5
  force_on_cpu: true
  top_k: 200
}
  force_on_cpu: true

didn’t change throughput and gpu/cpu utilization pattern.

Thanks for the info. We will check further.

I have the same issue with YoloV4. My hardware is a Supermicro Server 4028gr-tvrt with 8x Tesla V100.

For yolov4, you can use deeper backbone. For example, resnet101.
The gpu utilization will get higher.

Is it normal for the CPU usage to be so high? My cores are all at almost 100%. My GPU memory is also at full capacity. I can’t set my batch size higher than 4, otherwise I’ll run out of memory.

The resnet101 is a large network, so it consumes GPU memory. And also the augmentation consumes cpu resources. You can use different backbones which can support yolov4. For example, resnet50.

We are also doing some improvement to improve GPU utilization. Hopefully it will be available in next release.

I tried resnet101 as a backbone. The GPU utilization is only slightly higher. Most of the time all GPUs are at 0%. Training for 1 epoch takes about 2 hours with Resnet18 as well as Resnet101.

I also tried this:

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Please disable above augmentation and retry. Thanks.
hue: 0.0
saturation: 1.0
exposure:1.0

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.