TensorRT 3 RC and grouped convolutions


The latest TensorRT version features support for the grouped (aka depthwise-separable) convolutions, which makes it possible to convert MobileNet into TRT execution plan without using plugin layers.

MobileNet is way faster than VGG nets so I hoped to get a substantial speedup for my model by switching to it. However, the experiments show that the timings for MobileNet are indeed much higher than for the VGGNet.

I compare the Caffe-SSD VGG16 (GitHub - weiliu89/caffe at ssd) to the MobileNet version of the same model (model architecture is defined here: https://github.com/chuanqi305/MobileNet-SSD/blob/master/MobileNetSSD_deploy.prototxt).

On GTX 1080 the forward pass times are the following:

VGGNet    MobileNet
22.3 ms   20.3 ms  <- Caffe
10 ms     29 ms    <- TensorRT3

I also compared the encoder-only part (until pool5 for VGG, until conv11 for MobileNet) and I can see that the VGG encoder takes 7.6 ms to perform the forward pass, while its MobileNet counterpart - 19 ms.

Given that MobileNet computation cost is much less than the VGGNet (896.54M MACC vs 27.82B MACC) these results look very strange. Is is due to the fact that the current TRT is just a release candidate, or should we expect the same performance in the release version?

Is there no response here? I’m very interesting this issue, too.

After some research I found the reason is how the depthwise separable convolutions are implemented under the hood. I believe that in order to make it general, grouped convolutions are translated into a sequence of the default (and slow) convolution kernels. according to NVidia profiler, this results in several thousand calls to small convolutions instead of treating the depthwise-separable convolution as a solid operation.

I have overcome this issue by implementing the depthwise-separable convolution by myself using the Plugin layer mechanism and got 5x speed improvement compared to the VGG16. However, neither half precision nor 8-bit quantization is supported for the custom layers.


Glad to see your work.

Can you share the implementation of depthwise convolution layers in TensorRT?


I wish I could :( But since I did it for my company, I need a written permission, which is unlikely…
I can only say it is inspired by the TensorFlow implementation.


that’s ok.

I tried to run tensorflow before, it’s running time is about 3ms in GTX 1080.

And caffe implementation is about 8ms. (See GitHub - yonghenglh6/DepthwiseConvolution: A personal depthwise convolution layer implementation on caffe by liuhao.(only GPU))

How about running time in TensorRT with IPlugin layer?

Can you share how you did it?
I have a caffe model trained with my custom dataset in mobilenet-SSD. I want to implement it in tensorRT. Please kindly tell me how you did it. I’m new to deep Learning