The latest TensorRT version features support for the grouped (aka depthwise-separable) convolutions, which makes it possible to convert MobileNet into TRT execution plan without using plugin layers.
MobileNet is way faster than VGG nets so I hoped to get a substantial speedup for my model by switching to it. However, the experiments show that the timings for MobileNet are indeed much higher than for the VGGNet.
I compare the Caffe-SSD VGG16 (https://github.com/weiliu89/caffe/tree/ssd) to the MobileNet version of the same model (model architecture is defined here: https://github.com/chuanqi305/MobileNet-SSD/blob/master/MobileNetSSD_deploy.prototxt).
On GTX 1080 the forward pass times are the following:
VGGNet MobileNet 22.3 ms 20.3 ms <- Caffe 10 ms 29 ms <- TensorRT3
I also compared the encoder-only part (until pool5 for VGG, until conv11 for MobileNet) and I can see that the VGG encoder takes 7.6 ms to perform the forward pass, while its MobileNet counterpart - 19 ms.
Given that MobileNet computation cost is much less than the VGGNet (896.54M MACC vs 27.82B MACC) these results look very strange. Is is due to the fact that the current TRT is just a release candidate, or should we expect the same performance in the release version?