Extremely slow train and evaluation of yolo_v4_tiny

antond2 · March 22, 2023, 9:58am

Please provide the following information when requesting support.

• Hardware (8 x V100)
• Network Type (Yolo_v4, yolo_v4_tiny)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)

tao info --verbose
dockers:                                                                                                                                                                                                   
        nvidia/tao/tao-toolkit:                                                                                                                                                                            
                4.0.0-tf2.9.1:                                                                                                                                                                             
                        docker_registry: nvcr.io                                                                                                                                                           
                        tasks:                                                                                                                                                                             
                                1. classification_tf2                                                                                                                                                      
                                2. efficientdet_tf2                                                                                                                                                        
                4.0.0-tf1.15.5:                                                                                                                                                                            
                        docker_registry: nvcr.io                                                                                                                                                           
                        tasks:                                                                                                                                                                             
                                1. augment
                                2. bpnet
                                3. classification_tf1
                                4. detectnet_v2
                                5. dssd
                                6. emotionnet
                                7. efficientdet_tf1
                                8. faster_rcnn
                                9. fpenet
                                10. gazenet
                                11. gesturenet
                                12. heartratenet
                                13. lprnet
                                14. mask_rcnn
                                15. multitask_classification
                                16. retinanet
                                17. ssd
                                18. unet
                                19. yolo_v3
                                20. yolo_v4
                                21. yolo_v4_tiny
                                22. converter
...
Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit']
format_version: 2.0
toolkit_version: 4.0.1
published_date: 03/06/2023

• Training spec

random_seed: 42
yolov4_config {
  big_anchor_shape: "[(23.06, 27.89), (35.12, 41.84), (39.06, 46.57)]"
  mid_anchor_shape: "[(35.74, 51.05), (44.68, 42.58), (43.01, 48.06)]"
  box_matching_iou: 0.25
  matching_neutral_box_iou: 0.5
  arch: "cspdarknet_tiny"
  loss_loc_weight: 1.0
  loss_neg_obj_weights: 1.0
  loss_class_weights: 1.0
  label_smoothing: 0.0
  big_grid_xy_extend: 0.05
  mid_grid_xy_extend: 0.05
  freeze_bn: false
  #freeze_blocks: 0
  force_relu: false
}
training_config {
  batch_size_per_gpu: 8
  num_epochs: 300
  enable_qat: false
  checkpoint_interval: 5
  pretrain_model_path:"/workspace/weights/pretrained/pretrained_object_detection_vcspdarknet_tiny/cspdarknet_tiny.hdf5"
  learning_rate {
    soft_start_cosine_annealing_schedule {
      min_learning_rate: 1e-7
      max_learning_rate: 1e-4
      soft_start: 0.3
    }
  }
  regularizer {
    type: L1
    weight: 3e-5
  }
  optimizer {
    adam {
      epsilon: 1e-7
      beta1: 0.9
      beta2: 0.999
      amsgrad: false
    }
  }
  max_queue_size: 20
  use_multiprocessing: true
  n_workers: 10
}
eval_config {
  average_precision_mode: SAMPLE
  batch_size: 32
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.25
  clustering_iou_threshold: 0.5
  top_k: 200
}
augmentation_config {
  hue: 0.5
  saturation: 1.5
  exposure:1.5
  vertical_flip:0.5
  horizontal_flip: 0.5
  jitter: 0.3
  output_width: 512
  output_height: 512
  output_channel: 3
  randomize_input_shape_period: 10
  mosaic_prob: 0
  mosaic_min_ratio:0.2
}

• How to reproduce the issue? (This is for errors. Please share the command line and the detailed log here.)

!tao yolo_v4_tiny train -e configs/yolov4_training_conf.txt \
                       -r /workspace/checkpoints \
                       -k nvidia \
                       --gpus 8 \
                       --use_amp

The GPU utilisation is around 0 with spikes per batch.
CPU usage goes to 100% for all 80 cores to load a batch.

As you have seen I played with

training_config {  
  ...
  max_queue_size: 20
  use_multiprocessing: true
  n_workers: 10
}

augmentation_config {
  ...
  mosaic_prob: 0.0
  ...
}

That didn’t change a lot.

I also observed that both EVAL and TRAIN phases are slow.
Evaluation on 2000 images with batch size 32 ~ 63 batches takes 160 seconds for 8 GPUs or 0.63 seconds per image.
I use TFRecords for annotations. Initial image sizes 2464 × 2056, network input is 512x512.

Morganh · March 22, 2023, 4:24pm

Please set randomize_input_shape_period to 0 and retry.

antond2 · March 23, 2023, 9:26am

I tried

randomize_input_shape_period: 0

This gave 10% percent improvement, which is still slow.
The pattern of GPU vs CPU utilization didn’t change.

Morganh · March 23, 2023, 4:41pm

Please add force_on_cpu: true in nms_config.

nms_config {
  confidence_threshold: 0.25
  clustering_iou_threshold: 0.5
  force_on_cpu: true
  top_k: 200
}

antond2 · March 30, 2023, 9:29am

  force_on_cpu: true

didn’t change throughput and gpu/cpu utilization pattern.

Morganh · April 2, 2023, 4:18pm

Thanks for the info. We will check further.

deveso · April 3, 2023, 12:36pm

I have the same issue with YoloV4. My hardware is a Supermicro Server 4028gr-tvrt with 8x Tesla V100.

Morganh · April 3, 2023, 5:19pm

For yolov4, you can use deeper backbone. For example, resnet101.
The gpu utilization will get higher.

deveso · April 4, 2023, 5:34pm

Is it normal for the CPU usage to be so high? My cores are all at almost 100%. My GPU memory is also at full capacity. I can’t set my batch size higher than 4, otherwise I’ll run out of memory.

Morganh · April 9, 2023, 4:58pm

The resnet101 is a large network, so it consumes GPU memory. And also the augmentation consumes cpu resources. You can use different backbones which can support yolov4. For example, resnet50.

We are also doing some improvement to improve GPU utilization. Hopefully it will be available in next release.

deveso · April 11, 2023, 5:08pm

I tried resnet101 as a backbone. The GPU utilization is only slightly higher. Most of the time all GPUs are at 0%. Training for 1 epoch takes about 2 hours with Resnet18 as well as Resnet101.

I also tried this:

Morganh · April 12, 2023, 4:27pm

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Please disable above augmentation and retry. Thanks.
hue: 0.0
saturation: 1.0
exposure:1.0

system · May 18, 2023, 7:56am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Memory usage continue growing up when training TAO Toolkit	5	303	July 4, 2023
High ram usage with tlt ResNet TAO Toolkit	42	996	July 6, 2022
TAO yoloV4 cannot train from checkpoint TAO Toolkit	8	394	August 5, 2022
TLT yolo_v4 slow training TAO Toolkit	11	839	October 12, 2021
TAO action recogniton net trainning extremely slow TAO Toolkit tao	20	641	August 7, 2023
Training Become very slow Yolov4 TAO Toolkit	25	2094	January 25, 2022
Training of Yolov3 model get randomly killed TAO Toolkit	6	564	July 6, 2022
Training got killed before start TAO Toolkit	18	1439	February 8, 2022
[TLT] YoloV4 training fails. training process asigned to CPU instead of GPU? TAO Toolkit	8	442	August 9, 2022
Couldnt utilize full machine resources while training TAO Toolkit tao	3	13	August 27, 2024

Extremely slow train and evaluation of yolo_v4_tiny

Related topics