I am using peoplenet model on my custom dataset and when i am running tlt-train . it is giving model.summary but in model.summary we are having resnet34 architecture and from that we are having two layers one is output_bbox and output_cov. As far as i know output_bbox is for bounding box and output_cov is for coverage or confidence.
But in documentation they are saying peoplenet is object detection network built on NVIDIA’s detectnet_v2 architecture with ResNet34 as the backbone feature extractor. but i didn’t see any layers after resnet34 layers.
So, what all layers are present after resnet34 layers or directly they are using resnet34 and from there two output layers they are creating(bbox,cov)?
Detectnet_v2 is an object detection network. It can utilize the ResNet backbone feature extractor. ResNet is an industrial network that is on par with MobileNet and InceptionNet (two common backbone models for feature extraction).
For peoplenet, there is not layers after resnet34 layers.
The ground truth generator for DetectNet_v2 generates 2 tensors namely, cov and bbox. The image is divided into 16x16 grid cells. The cov tensor(short for coverage tensor) defines the number of gridcells that are covered by an object. The bbox tensor defines the normalized image coordinates of the object (x1, y1) top left and (x2, y2) bottom right with respect to the grid cell.
Thanks @Morganh for reply!