Train yolov3

jiyuwang · October 27, 2020, 2:30am

I encountered this warning when I was training, and later I found that the map trained by yolov3 has been very low. Is it because of this warning? what should I do?

Morganh · October 27, 2020, 2:45am

Hi jiyuwang,
The screenshot is hard to check. Could you please attach training log as txt format? Please attach your training spec too. Thanks.

jiyuwang · October 27, 2020, 2:48am

j1027100919_full.log (193.8 KB)

jiyuwang · October 27, 2020, 2:52am

Now faster-rcnn and retina-net has been applied to my data set, but there are always problems with the map and loss of yolov3. I changed the learning rate several times.

Morganh · October 27, 2020, 3:14am

I am afraid some parameters needs finetune.
Please refer to jupyter notebook’s spec too. (examples/yolo/specs/yolo_train_resnet18_kitti.txt)

Suggesting:

Set
freeze_blocks: 0
freeze_bn: false
Set a larger batch_size_per_gpu, for example, 4
Then finetune the learning rate if there is nan loss

jiyuwang · October 27, 2020, 3:17am

OK, let me try. batch_size_per_gpu is too large, our figure is 1920,1080 (More than 4 times larger than the default value). When I set to 4, OOM occurred.

Morganh · October 27, 2020, 3:19am

If possible, I am afraid you can use two gpus to train.
Or you can resize 1920x1080 images/labels to 960x544.

jiyuwang · October 27, 2020, 3:23am

I used 4 GPUs, each GPU is 16G. Yes, I have thought about reducing the size, but this may destroy the pixels or information of the original image more or less, so I did not choose to shrink it. Secondly, our images are already labeled, and I need to modify the label box again.

Morganh · October 27, 2020, 3:28am

What is the meaning of your attached screenshot for yolo_v3? Do you mean you already train a yolo_v3 model successfully?
More, if you use 4gpus, I am afraid it will not result in OOM.

jiyuwang · October 27, 2020, 3:30am

This yolov3 uses the default settings, and the backbone network is resnet18. I now try to use darknet53, and darknet53 always has problems… This troubles me a lot.

jiyuwang · October 27, 2020, 3:31am

In fact, I tried to set batch_size to 4 but an OOM error did occur, so I adjusted it down.

Morganh · October 27, 2020, 3:31am

If possible, could you share some images for me to train?

Morganh · October 27, 2020, 3:35am

More, could you try darknet53 in default jupyter notebook to train KITTI dataset with your 4gpus? Is it any error?

jiyuwang · October 27, 2020, 6:22am

Sorry, this is our company’s private data, I am afraid I can’t give it to you. I used 4 GPUs to train the faster-rcnn and retina-net models without any problems, quite accurate. But yolov3 failed.

Morganh · October 27, 2020, 6:26am

OK，as I mentioned above, please try to set/finetune some hyper-parameters(bs, learning_rate, freeze_blocks,freeze_bn, etc) for yolo spec.
You can also try to compare with the public KITTI dataset with your spec and 4 gpus.

jiyuwang · October 27, 2020, 6:31am

OK.let me try again.

Morganh · October 27, 2020, 6:38am

A small tip: To speed up your experiments, please try to train with a small part of your training data. This is in order to settle down your training spec. If the mAP is expected, then increase the training data.

jiyuwang · October 27, 2020, 6:41am

In fact, we only have 2900 images in total. …

Morganh · October 27, 2020, 6:42am

OK, it is fine.

jiyuwang · October 27, 2020, 7:10am

will ‘export TF_FORCE_GPU_ALLOW_GROWTH=true’ cause the error?