I am currently trying to use the two-class DetectNet (https://github.com/NVIDIA/caffe/blob/caffe-0.15/examples/kitti/detectnet_network-2classes.prototxt) in the DriveWorks environment using a modified DriveNet sample. However, I have run into a weird issue.
The documentation states that the coverage and bbox output blobs should have 1 and 4 channels per class. DetectNet correctly provides one channel per class for coverage, but only 4 bbox channels, regardless of the number of classes.
Interestingly, detection of the second class (pedestrians) works, but only when feeding the image to the network twice, as seen in the DriveNet sample. This also seems to ignore the ROI for the second instance of the image. When feeding a single image to the network, only the first class (cars) is detected. The PX2 can handle DetectNet at an acceptable framerate, but only when supplying a single image.
Is there documentation on this behaviour? Is there a way to make multiclass detection work for DetectNet while only feeding the network a single image, or to otherwise improve the performance?
I greatly appreciate any help!