Reproducing YoloV4 COCO mAP

Hardware Platform (Jetson / GPU)
GPU

DeepStream Version
nvcr.io/nvidia/deepstream:6.0-devel

NVIDIA GPU Driver Version (valid for GPU only)
NVIDIA-SMI 495.46 Driver Version: 495.46 CUDA Version: 11.5

Issue Type( questions, new requirements, bugs)
I have been trying to replicate the Darknet YoloV4 results for the COCO dataset as I really like the TAO workflow but have been unable to match Darknet in terms of accuracy (mAP) as I am consistently lower.

Given the resources at your disposal, are you able to produce a training spec (with bonus points for an official NGC model AI Models - Computer Vision, Conversational AI, and More | NVIDIA NGC) that produces per class accuracy similar to these which were calculated by running the Darknet official yolov4.weights and yolov4.cfg against the COCO2017 Validation set (5000 images)? I am sure this would be very helpful as a starting point for training custom YoloV4 models.

class_id = 0, name = person, ap = 79.29%   	 (TP = 7956, FP = 3157) 
class_id = 1, name = bicycle, ap = 60.28%   	 (TP = 173, FP = 94) 
class_id = 2, name = car, ap = 68.93%   	 (TP = 1290, FP = 703) 
class_id = 3, name = motorcycle, ap = 74.49%   	 (TP = 266, FP = 134) 
class_id = 4, name = airplane, ap = 90.64%   	 (TP = 124, FP = 26) 
class_id = 5, name = bus, ap = 84.81%   	 (TP = 221, FP = 56) 
class_id = 6, name = train, ap = 93.01%   	 (TP = 168, FP = 37) 
class_id = 7, name = truck, ap = 61.90%   	 (TP = 253, FP = 214) 
class_id = 8, name = boat, ap = 54.23%   	 (TP = 223, FP = 132) 
class_id = 9, name = traffic light, ap = 55.11%   	 (TP = 371, FP = 216) 
class_id = 10, name = fire hydrant, ap = 89.23%   	 (TP = 86, FP = 11) 
class_id = 11, name = stop sign, ap = 77.69%   	 (TP = 56, FP = 19) 
class_id = 12, name = parking meter, ap = 68.42%   	 (TP = 38, FP = 14) 
class_id = 13, name = bench, ap = 43.16%   	 (TP = 178, FP = 195) 
class_id = 14, name = bird, ap = 53.50%   	 (TP = 223, FP = 102) 
class_id = 15, name = cat, ap = 90.56%   	 (TP = 167, FP = 52) 
class_id = 16, name = dog, ap = 82.53%   	 (TP = 178, FP = 65) 
class_id = 17, name = horse, ap = 85.73%   	 (TP = 226, FP = 70) 
class_id = 18, name = sheep, ap = 78.52%   	 (TP = 287, FP = 136) 
class_id = 19, name = cow, ap = 80.75%   	 (TP = 287, FP = 93) 
class_id = 20, name = elephant, ap = 87.41%   	 (TP = 228, FP = 64) 
class_id = 21, name = bear, ap = 92.45%   	 (TP = 62, FP = 5) 
class_id = 22, name = zebra, ap = 91.89%   	 (TP = 226, FP = 41) 
class_id = 23, name = giraffe, ap = 93.04%   	 (TP = 206, FP = 33) 
class_id = 24, name = backpack, ap = 33.61%   	 (TP = 132, FP = 189) 
class_id = 25, name = umbrella, ap = 69.18%   	 (TP = 283, FP = 163) 
class_id = 26, name = handbag, ap = 33.49%   	 (TP = 196, FP = 262) 
class_id = 27, name = tie, ap = 57.87%   	 (TP = 140, FP = 74) 
class_id = 28, name = suitcase, ap = 71.06%   	 (TP = 201, FP = 112) 
class_id = 29, name = frisbee, ap = 88.07%   	 (TP = 99, FP = 34) 
class_id = 30, name = skis, ap = 51.67%   	 (TP = 118, FP = 73) 
class_id = 31, name = snowboard, ap = 56.31%   	 (TP = 39, FP = 23) 
class_id = 32, name = sports ball, ap = 62.70%   	 (TP = 168, FP = 87) 
class_id = 33, name = kite, ap = 67.57%   	 (TP = 218, FP = 135) 
class_id = 34, name = baseball bat, ap = 61.75%   	 (TP = 83, FP = 36) 
class_id = 35, name = baseball glove, ap = 65.70%   	 (TP = 95, FP = 44) 
class_id = 36, name = skateboard, ap = 79.67%   	 (TP = 142, FP = 33) 
class_id = 37, name = surfboard, ap = 63.34%   	 (TP = 163, FP = 83) 
class_id = 38, name = tennis racket, ap = 85.23%   	 (TP = 188, FP = 64) 
class_id = 39, name = bottle, ap = 58.25%   	 (TP = 583, FP = 424) 
class_id = 40, name = wine glass, ap = 58.73%   	 (TP = 180, FP = 112) 
class_id = 41, name = cup, ap = 64.70%   	 (TP = 567, FP = 425) 
class_id = 42, name = fork, ap = 59.49%   	 (TP = 117, FP = 93) 
class_id = 43, name = knife, ap = 35.51%   	 (TP = 107, FP = 113) 
class_id = 44, name = spoon, ap = 36.94%   	 (TP = 89, FP = 139) 
class_id = 45, name = bowl, ap = 61.86%   	 (TP = 382, FP = 320) 
class_id = 46, name = banana, ap = 43.44%   	 (TP = 152, FP = 144) 
class_id = 47, name = apple, ap = 29.17%   	 (TP = 83, FP = 115) 
class_id = 48, name = sandwich, ap = 57.77%   	 (TP = 97, FP = 84) 
class_id = 49, name = orange, ap = 40.90%   	 (TP = 139, FP = 173) 
class_id = 50, name = broccoli, ap = 45.10%   	 (TP = 139, FP = 156) 
class_id = 51, name = carrot, ap = 35.05%   	 (TP = 162, FP = 275) 
class_id = 52, name = hot dog, ap = 54.20%   	 (TP = 60, FP = 36) 
class_id = 53, name = pizza, ap = 73.71%   	 (TP = 207, FP = 95) 
class_id = 54, name = donut, ap = 62.85%   	 (TP = 222, FP = 154) 
class_id = 55, name = cake, ap = 62.36%   	 (TP = 188, FP = 126) 
class_id = 56, name = chair, ap = 56.48%   	 (TP = 998, FP = 835) 
class_id = 57, name = couch, ap = 65.76%   	 (TP = 165, FP = 125) 
class_id = 58, name = potted plant, ap = 52.67%   	 (TP = 192, FP = 198) 
class_id = 59, name = bed, ap = 72.57%   	 (TP = 113, FP = 52) 
class_id = 60, name = dining table, ap = 47.17%   	 (TP = 368, FP = 401) 
class_id = 61, name = toilet, ap = 85.77%   	 (TP = 150, FP = 42) 
class_id = 62, name = tv, ap = 83.08%   	 (TP = 230, FP = 82) 
class_id = 63, name = laptop, ap = 80.98%   	 (TP = 180, FP = 74) 
class_id = 64, name = mouse, ap = 82.85%   	 (TP = 85, FP = 34) 
class_id = 65, name = remote, ap = 60.85%   	 (TP = 166, FP = 115) 
class_id = 66, name = keyboard, ap = 76.71%   	 (TP = 115, FP = 70) 
class_id = 67, name = cell phone, ap = 62.18%   	 (TP = 165, FP = 97) 
class_id = 68, name = microwave, ap = 77.63%   	 (TP = 44, FP = 22) 
class_id = 69, name = oven, ap = 65.43%   	 (TP = 90, FP = 65) 
class_id = 70, name = toaster, ap = 60.70%   	 (TP = 5, FP = 5) 
class_id = 71, name = sink, ap = 65.99%   	 (TP = 148, FP = 80) 
class_id = 72, name = refrigerator, ap = 81.52%   	 (TP = 100, FP = 47) 
class_id = 73, name = book, ap = 26.10%   	 (TP = 298, FP = 378) 
class_id = 74, name = clock, ap = 73.27%   	 (TP = 200, FP = 77) 
class_id = 75, name = vase, ap = 58.27%   	 (TP = 175, FP = 153) 
class_id = 76, name = scissors, ap = 51.90%   	 (TP = 17, FP = 8) 
class_id = 77, name = teddy bear, ap = 71.03%   	 (TP = 134, FP = 63) 
class_id = 78, name = hair drier, ap = 7.12%   	 (TP = 1, FP = 3) 
class_id = 79, name = toothbrush, ap = 40.25%   	 (TP = 27, FP = 28)

 for conf_thresh = 0.25, precision = 0.69, recall = 0.65, F1-score = 0.67 
 for conf_thresh = 0.25, TP = 24077, FP = 10831, FN = 12704, average IoU = 56.97 % 

 IoU threshold = 50 %, used Area-Under-Curve for each unique Recall 
 mean average precision (mAP@0.50) = 0.703672, or 70.37 % 

May I know how did you train and get above result?

Hi.
This was produced by running the command below as per docs on the Darknet repository: https://github.com/AlexeyAB/darknet

The yolov4.cfg is the official configuration file from here: https://raw.githubusercontent.com/AlexeyAB/darknet/master/cfg/yolov4.cfg

The yolov4.weights file is the official weights file for YoloV4 from here: https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v3_optimal/yolov4.weights

The .cfg file plus .weights are the official files used to produce the current #14 place on the COCO Benchmark: COCO Benchmark (Real-Time Object Detection) | Papers With Code

The images for running this validation are the 5000 validation images available from the Coco Downloads page: http://images.cocodataset.org/zips/val2017.zip

darknet detector map ./obj.data yolov4.cfg yolov4.weights

I am hopeful that you are able to reproduce the same results with TAO.

Thanks for the info. I will check further.
More, for yolov3 SOTA experiment, please refer to

Thank you @Morganh. I have taken the ‘SOTA’ Yolov3 specs and adapted for Yolov4 (a subset of classes). I will report back once training is done with a comparison to Darknet.

Please note that for tao 21.11 version, in yolov4,
loss_loc_weight: 1.0
loss_neg_obj_weights: 1.0
loss_class_weights: 1.0

More setting can be found in YOLOv4 — TAO Toolkit 3.21.11 documentation

More, before training with YOLOv4, it is necessary to train and get a good pretrained model with Imagenet dataset using classification network. See Prepare state of the art models for classification and object detection with TAO.

Thanks @Morganh . Do the pretrained models from nvidia such as nvidia/tao/pretrained_object_detection:cspdarknet53 meet your definition of a ‘good pretrained model’?

It is not. That model is training against OpenImage dataset.
Please follow Preparing State-of-the-Art Models for Classification and Object Detection with NVIDIA TAO Toolkit | NVIDIA Developer Blog to train a classification model against Imagenet dataset.Due to copyright issues, we can’t provide the ImageNet dataset or any ImageNet-pretrained models in TAO Toolkit.

To train a classification model with cspdarkent53 model, please modify
arch: “darknet”
to
arch: “cspdarknet”

in deepstream_tao_apps/darknet53.txt at release/tao3.0 · NVIDIA-AI-IOT/deepstream_tao_apps · GitHub

Thank you @Morganh. I have this training and it should complete in the next week or two :D

Please leave this topic open so I can update once this step is complete.

More, please note that in official github, we should use coco14 instead of coco2017.
Official darknet PTM is trained on coco14.

Below should be the training images and val images.

trainval2014 :117264 images
val5k: 4954 images

See the details in Inconsistent splits between COCO 2014 and COCO 2017? · Issue #5751 · AlexeyAB/darknet · GitHub
and https://developer.nvidia.com/blog/preparing-state-of-the-art-models-for-classification-and-object-detection-with-tao-toolkit/

Download the COCO 2014 dataset from the COCO website. To compare with the SOTA model, do the training/testing split the same way as the original author. Also, the author’s training/validation split is different from the COCO 2014 official training/validation split and can be reproduced by the get_coco_dataset.sh bash file.

Using the bash file, get 5k.txt and no5k.txt. Those are the file names for validation and training images/labels. After preparing the data following the COCO 2014 data preparation section, merge the original training/validation set and re-split it according to those two files.

Just to provide an update, I have successfully trained a cspdarknet53 backbone from scratch using ImageNet2012. After 200 epochs reached a Top 1 accuracy of 77.19% which is inline with expected results. I think there may be some gains to be had by trying activation: mish which I may do.

I will now try to train a COCO model with this backbone and the Darknet COCO 2014 split to confirm TAO produces similar results.

May I know is it test accuracy?

That was validation set accuracy. Training accuracy finished at 74.09%. I feel that it may improve with a few more epochs still.

Share the result of two kinds of models internally.
1st: default activation, 200 epochs. training accuracy: 0.769, val_accuracy: 0.7813
2nd: activation:mish . 300 epochs. training accuracy: 0.79998, val_accuracy: 0.7883575

Great this is very useful information. I did try multigpu training for part of the process (which I believe impacts training) so I will try to continue the run until we reach your results.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.