Please provide the following information when requesting support.
• Hardware (8 x V100)
• Network Type (Yolo_v4, yolo_v4_tiny)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
tao info --verbose
dockers:
nvidia/tao/tao-toolkit:
4.0.0-tf2.9.1:
docker_registry: nvcr.io
tasks:
1. classification_tf2
2. efficientdet_tf2
4.0.0-tf1.15.5:
docker_registry: nvcr.io
tasks:
1. augment
2. bpnet
3. classification_tf1
4. detectnet_v2
5. dssd
6. emotionnet
7. efficientdet_tf1
8. faster_rcnn
9. fpenet
10. gazenet
11. gesturenet
12. heartratenet
13. lprnet
14. mask_rcnn
15. multitask_classification
16. retinanet
17. ssd
18. unet
19. yolo_v3
20. yolo_v4
21. yolo_v4_tiny
22. converter
...
Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit']
format_version: 2.0
toolkit_version: 4.0.1
published_date: 03/06/2023
• Training spec
random_seed: 42
yolov4_config {
big_anchor_shape: "[(23.06, 27.89), (35.12, 41.84), (39.06, 46.57)]"
mid_anchor_shape: "[(35.74, 51.05), (44.68, 42.58), (43.01, 48.06)]"
box_matching_iou: 0.25
matching_neutral_box_iou: 0.5
arch: "cspdarknet_tiny"
loss_loc_weight: 1.0
loss_neg_obj_weights: 1.0
loss_class_weights: 1.0
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.05
freeze_bn: false
#freeze_blocks: 0
force_relu: false
}
training_config {
batch_size_per_gpu: 8
num_epochs: 300
enable_qat: false
checkpoint_interval: 5
pretrain_model_path:"/workspace/weights/pretrained/pretrained_object_detection_vcspdarknet_tiny/cspdarknet_tiny.hdf5"
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-7
max_learning_rate: 1e-4
soft_start: 0.3
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
max_queue_size: 20
use_multiprocessing: true
n_workers: 10
}
eval_config {
average_precision_mode: SAMPLE
batch_size: 32
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.25
clustering_iou_threshold: 0.5
top_k: 200
}
augmentation_config {
hue: 0.5
saturation: 1.5
exposure:1.5
vertical_flip:0.5
horizontal_flip: 0.5
jitter: 0.3
output_width: 512
output_height: 512
output_channel: 3
randomize_input_shape_period: 10
mosaic_prob: 0
mosaic_min_ratio:0.2
}
• How to reproduce the issue? (This is for errors. Please share the command line and the detailed log here.)
!tao yolo_v4_tiny train -e configs/yolov4_training_conf.txt \
-r /workspace/checkpoints \
-k nvidia \
--gpus 8 \
--use_amp
The GPU utilisation is around 0 with spikes per batch.
CPU usage goes to 100% for all 80 cores to load a batch.
As you have seen I played with
training_config {
...
max_queue_size: 20
use_multiprocessing: true
n_workers: 10
}
augmentation_config {
...
mosaic_prob: 0.0
...
}
That didn’t change a lot.
I also observed that both EVAL and TRAIN phases are slow.
Evaluation on 2000 images with batch size 32 ~ 63 batches takes 160 seconds for 8 GPUs or 0.63 seconds per image.
I use TFRecords for annotations. Initial image sizes 2464 × 2056, network input is 512x512.