Train yolov3

I encountered this warning when I was training, and later I found that the map trained by yolov3 has been very low. Is it because of this warning? what should I do?

Hi jiyuwang,
The screenshot is hard to check. Could you please attach training log as txt format? Please attach your training spec too. Thanks.

j1027100919_full.log (193.8 KB)

Now faster-rcnn and retina-net has been applied to my data set, but there are always problems with the map and loss of yolov3. I changed the learning rate several times.

I am afraid some parameters needs finetune.
Please refer to jupyter notebook’s spec too. (examples/yolo/specs/yolo_train_resnet18_kitti.txt)


  1. Set
    freeze_blocks: 0
    freeze_bn: false
  2. Set a larger batch_size_per_gpu, for example, 4
  3. Then finetune the learning rate if there is nan loss

OK, let me try. batch_size_per_gpu is too large, our figure is 1920,1080 (More than 4 times larger than the default value). When I set to 4, OOM occurred.

If possible, I am afraid you can use two gpus to train.
Or you can resize 1920x1080 images/labels to 960x544.

I used 4 GPUs, each GPU is 16G. Yes, I have thought about reducing the size, but this may destroy the pixels or information of the original image more or less, so I did not choose to shrink it. Secondly, our images are already labeled, and I need to modify the label box again.

What is the meaning of your attached screenshot for yolo_v3? Do you mean you already train a yolo_v3 model successfully?
More, if you use 4gpus, I am afraid it will not result in OOM.

This yolov3 uses the default settings, and the backbone network is resnet18. I now try to use darknet53, and darknet53 always has problems… This troubles me a lot.

In fact, I tried to set batch_size to 4 but an OOM error did occur, so I adjusted it down.

If possible, could you share some images for me to train?

More, could you try darknet53 in default jupyter notebook to train KITTI dataset with your 4gpus? Is it any error?

Sorry, this is our company’s private data, I am afraid I can’t give it to you. I used 4 GPUs to train the faster-rcnn and retina-net models without any problems, quite accurate. But yolov3 failed.

OK,as I mentioned above, please try to set/finetune some hyper-parameters(bs, learning_rate, freeze_blocks,freeze_bn, etc) for yolo spec.
You can also try to compare with the public KITTI dataset with your spec and 4 gpus.

OK.let me try again.

A small tip: To speed up your experiments, please try to train with a small part of your training data. This is in order to settle down your training spec. If the mAP is expected, then increase the training data.

In fact, we only have 2900 images in total. …

OK, it is fine.

will ‘export TF_FORCE_GPU_ALLOW_GROWTH=true’ cause the error?