I am training a custom object detection model (resnet-10 and detectnet_v2) for 6 classes using VOC/COCO dataset. I have convert these datasets to kitti data format, created the recorder, edited the spec file for multi-class detector. However, when I evaluate the trained model after 50 epochs, I do not get reliable average precision figures.
I am getting the following MAP results:
class name average precision (in %)
------------ --------------------------
bicycle 0
bus 2.48739
car 0
motorbike 0.42388
person 6.92905
truck 0.442265
During training and evaluate I got the following message:
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
During evaluate, I got the following message:
One or more metadata field(s) are missing from ground_truth batch_data, and will be replaced with defaults: ['frame/camera_location']
Following is the statistic for number of data samples:
Number of images in the trainval set. 319492
Number of labels in the trainval set. 319492
Number of images in the test set. 7518
Kindly check a sample data format used for recorder conversion below below:
I have replaced all the parameters with the value -1. From the documentation, I could understand that only the class name and bounding box corners (xmin , ymin, xmax , ymax) need to be provided.
Also, training and validation data are combined together with training data. Does that mean even training might be having some issues because of this ?
Kindly help me out if other ground truth data is required.
For DetectNet_v2,the tlt-train tool does not support training on images of multiple resolutions, or resizing images during training. All of the images must be resized offline to the final training size and the corresponding bounding boxes must be scaled accordingly.
Thanks for your help. I checked the requirement for the same image size in the sample dataset pointed to for running the DetectNet_v2 sample. The images do not seem to be of the same size. For example,
How has the resizing been done if any ? Since we provide these images directly for recorder generation, it implies that images of different resolution are being passed to training tool.
Hi noephyte1,
The KITTI dataset(1242x375,1238x374,1224x370,1241x376) almost matches spec (1248,384) but not exactly. During training, there is a crop step to crop them into the same size.If original image is smaller than model input size, then crop will become padding.
But you mentioned that your dataset is VOC dataset and COCO dataset(640x480).It is far away from (1248,384).
So for detectnet_v2, please resize them offline to the final training size.
Where do we mention final training size ? Can we change (1248,384) to (480,480) for example? As advised, I am resizing all my images to 480x480 and then feeding it for training.
I made the change in the train config file as below :
Thanks for your help. After training the model for 50 epochs, I get bizarre results. Only for car class I get a significantly low precision. Please find the results of evaluation below:
Validation cost: 0.000277
Mean average_precision (in %): 27.2255
class name average precision (in %)
------------ --------------------------
bicycle 18.8004
bus 45.8717
car 6.98904
motorbike 28.2466
person 44.9052
truck 18.54
Median Inference Time: 0.004990
How to change the parameters of the training config file based on the training input image size ? Is there any documentation explaining how to customize the parameters? Probably that is the reason for the terrible performance of detector for some classes.
Hi neophyte1,
Thanks for the info. Is it possbile to narrow down the low map for car via more experiments?
Could you retrain with batch size 4 and epochs 120? Your bs is 16.
or 2) Train only 3 classes: person/bicycle/car
or 3) change the (480,480) to other resolution.
More, could you check the correctness of all the labels? And make sure the data and label are matched.
I am using multiple GPUs - 4 of them with batch size 16 per GPU currently. I have tried with batch size 6 per GPU as well. However, the results were not improving. Should I try with batch size 4 per GPU with 4 GPUs or overall batch size of 4 ?
Yesterday, I clubbed motorbike and bicycle class to cyclist as given in the example config and car, bus and truck to car using class mapping in the config. I used the same config parameters as given in the sample. However, the performance deteriorated. Please note that with sample KITTI dataset, the mean average precision is quite high for all 3 classes. I will try just using bicycle, person and car without clubbing and will let you know the results.
Should I try (640,480) or (720,480) as in the sample the size is (1248,384). May be I should not feed square input size?
I will recheck correctness of all labels. However, I have visualized multiple times to make sure the data and labels are matched. I can upload some sample images and labels if you wish to cross check. Please let me know.
Hi neophyte1,
More pointers are as below. You can do more experiments via one pointer or several.
Could you check raw image size from your VOC/COCO dataset? And calculate what is raw image aspect ratio? If training spec’s width/height changes too much for size and aspect ratio, the result will not be good.
Does the dataset have a lot of small cars and trucks? If the targets are small, may expect small AP.
In your spec, car’s class_weight is too small.Expect to increase weight.
Person class_weight 4 , bbox weight 10
Bicycle class_weight 8 , bbox weight 1
Car class_weight 1 , bbox weight 10
Motorbike class_weight 8 , bbox weight 1
Bus class_weight 8 , bbox weight 1
Truck class_weight 8 , bbox weight 1
minimum_bounding_box_height: 20
Could it reduce to 10? if there are a lot of small targets, this filters out them.
minimum_detection_ground_truth_overlap {
key: “car”
value: 0.699999988079
}
Car IoU threshold is 0.7 while all others is 0.5 during evaluation.
More,we have no explicit guidance in the doc about how to tune hyper-parameters. More experiments are expected.The “load_graph” cat set to false by default in training config file. But for a pruned, please remember to set this parameter as True. See tlt doc for details.
Thanks for the pointers. Many of the pointers worked. However, I still have doubts regarding the parameter of batch size. Following are some of the results I performed on just VOC dataset for 3 classes for 22 epochs with default configuration:
No. of GPUs : 4
Batch Size per GPU : 4
Average Precision (%):
bicycle : 3.61833
car : 0
person : 12.6573
No. of GPUs : 1
Batch Size per GPU : 4
Average Precision (%):
bicycle : 30
car : 0.5
person : 28
Please do not focus on precision of car class as I did not tweak the parameters as suggested by you for “car” class in this experiment. I have fixed the precision for car class by adding more data from coco dataset and tweaking the parameters as suggested by you.
Kindly let me know if these observations seem correct. If the results are to be believed, then I have the following queries:
How to make the training work for multiple gpus ?
How to make the training work for greater batch size per gpu ?
Please let me know and thanks for the pointers again.
Hi neophyte1,
1)See Integrating TAO Models into DeepStream — TAO Toolkit 3.22.05 documentation ,
tlt-train command supports multiGPU training. You can invoke a multi GPU training session by using the --gpus N option, where N is the number of GPUs you want to use. N must be less than the number of GPUs available in the given node for training.
2) batch_size_per_gpu can be configured in spec file.
Let me update you with my progress. I am first trying to achieve accuracy. Hence, I opted for Resnet-18 backbone. After training, I got really impressive results after following your guidelines. However, I do not understand how to prune and retrain for Resnet-18 backbone for my dataset. Somehow, pruning and retraining was successful with Resnet-10 backbone using the parameter of “prune threshold” set in the example. When I use the same value for pruning and retraining Resnet-18 model I get terrible results. Following are the results of pruning and retraining using pth = 5.2e-6.
Results of training :
Validation cost: 0.002584
Mean average_precision (in %): 30.4075
class name average precision (in %)
------------ --------------------------
bicycle 10.2874
car 32.4107
person 48.5244
Median Inference Time: 0.007108
Results after pruning and retraining:
class name average precision (in %)
------------ --------------------------
bicycle 0
car 1.67
person 25.30