TensorRT 3 RC and grouped convolutions

dmytro.prylipko · October 6, 2017, 12:25pm

Hi,

The latest TensorRT version features support for the grouped (aka depthwise-separable) convolutions, which makes it possible to convert MobileNet into TRT execution plan without using plugin layers.

MobileNet is way faster than VGG nets so I hoped to get a substantial speedup for my model by switching to it. However, the experiments show that the timings for MobileNet are indeed much higher than for the VGGNet.

I compare the Caffe-SSD VGG16 (GitHub - weiliu89/caffe at ssd) to the MobileNet version of the same model (model architecture is defined here: https://github.com/chuanqi305/MobileNet-SSD/blob/master/MobileNetSSD_deploy.prototxt).

On GTX 1080 the forward pass times are the following:

VGGNet    MobileNet
22.3 ms   20.3 ms  <- Caffe
10 ms     29 ms    <- TensorRT3

I also compared the encoder-only part (until pool5 for VGG, until conv11 for MobileNet) and I can see that the VGG encoder takes 7.6 ms to perform the forward pass, while its MobileNet counterpart - 19 ms.

Given that MobileNet computation cost is much less than the VGGNet (896.54M MACC vs 27.82B MACC) these results look very strange. Is is due to the fact that the current TRT is just a release candidate, or should we expect the same performance in the release version?

jbm.park · January 3, 2018, 6:42am

Is there no response here? I’m very interesting this issue, too.

dmytro.prylipko · January 3, 2018, 9:32am

After some research I found the reason is how the depthwise separable convolutions are implemented under the hood. I believe that in order to make it general, grouped convolutions are translated into a sequence of the default (and slow) convolution kernels. according to NVidia profiler, this results in several thousand calls to small convolutions instead of treating the depthwise-separable convolution as a solid operation.

I have overcome this issue by implementing the depthwise-separable convolution by myself using the Plugin layer mechanism and got 5x speed improvement compared to the VGG16. However, neither half precision nor 8-bit quantization is supported for the custom layers.

OnePieceOfDeepLearning · January 3, 2018, 2:14pm

Hi,

Glad to see your work.

Can you share the implementation of depthwise convolution layers in TensorRT?

Thanks

dmytro.prylipko · January 3, 2018, 2:24pm

I wish I could :( But since I did it for my company, I need a written permission, which is unlikely…
I can only say it is inspired by the TensorFlow implementation.

OnePieceOfDeepLearning · January 3, 2018, 4:16pm

Hi,

that’s ok.

I tried to run tensorflow before, it’s running time is about 3ms in GTX 1080.

And caffe implementation is about 8ms. (See GitHub - yonghenglh6/DepthwiseConvolution: A personal depthwise convolution layer implementation on caffe by liuhao.(only GPU))

How about running time in TensorRT with IPlugin layer?

piyalgeorge · October 30, 2018, 6:28am

Can you share how you did it?
I have a caffe model trained with my custom dataset in mobilenet-SSD. I want to implement it in tensorRT. Please kindly tell me how you did it. I’m new to deep Learning

Topic		Replies	Views
TensorRT 3 grouped deconvolution slower than non-grouped TensorRT	4	815	May 2, 2018
Inference Time on TX2 with MobileNet Jetson TX2	5	982	October 18, 2021
depthwise convolution is very slow using tensorrt3.0 Jetson TX2	11	4747	May 14, 2019
grouped (aka depthwise-separable) convolutions for int8 TensorRT	6	1834	October 12, 2021
Depthwise conv3d slower than normal conv3d TensorRT	2	416	November 15, 2023
Is tensorrt slow with group convolution? TensorRT	5	1401	June 7, 2021
TensorRT 2x slower than Cudnn for single Conv2D (74 ms vs. 156 ms) TensorRT	6	905	February 5, 2021
TensorRT 6 slower than TensorFlow with 3D convolutions and pooling TensorRT	6	1609	December 20, 2019
Whats the different between Deconvolution groups and deconvolutional layers? Jetson TX2	4	1724	October 18, 2021
Anyway to add a nvinfer::IConvolutionLayer in PluginFactory? TensorRT	0	619	April 17, 2019

TensorRT 3 RC and grouped convolutions

Related topics