question on 1x1 conv acceleration of tensorRT

I converted mobilenet v1/v2 tensorflow model by trt.create_inference_graph(); then use it in tensorflow-tensorrt.
however, the performance improvement isn’t much, it is ~10% improvement on infer time. (FLOPs/second: 104.72B)
(for other net as inception v2, the improvement is higher).

then I did some investigation:
most computation of mobilenet v1/v2 is about 1x1 conv, and 1x1 conv is memory friendly with NHWC data format. but tensorRT supports NCHW data format.
in <<TensorRT-Developer-Guide 5.pdf>>, the mentioned op includes Conv2d and DepthwiseCOnv2dNative. I guess 1x1 conv is treats as common Conv2d; not be optimized specifically.

since 1x1 conv is widely used in current net models, could tensorRT do some optimize on it?
for example: it may benefit if NHWC data format is supported.

Thank you for your feedback. Will bring this up with our engineering team.

regards,
NVES

Is there any update? thanks

Hello,

The engineering team is reviewing this enhancement request. I don’t have additional info to share publically. Please stay tuned for future release announcements.

I tried to analysis the profiling data, and found that the depthwise conv in mobilenet cost much more time.
in the bottleneck structure of mobilenet, 1x1 op has much bigger computation than depthwise 3x3 conv; but depthwise 3x3 cost much more time.
seems optimization is required for depthwise conv for tensorRT.

another question: what’s the meaning for ‘depthwise input reformatter’?

expand

expanded_conv_5/expand/Conv2D                      :   0.073ms(0.191%)   >>84.892%<<
expanded_conv_5/expand/Relu6                       :   0.058ms(0.152%)   >>89.569%<<
expanded_conv_5/expand/Relu6/relu1                 :   0.055ms(0.144%)   >>91.052%<<
expanded_conv_5/expand/Relu6/sub1 + FeatureExtractor/MobilenetV2/expanded_conv_5/expand/Relu6/relu2:   0.076ms(0.199%)   >>84.115%<<

depthwise

expanded_conv_5/depthwise/depthwise                :   0.313ms(0.820%)   >>50.349%<<
expanded_conv_5/depthwise/depthwise input reformatter 0:   0.125ms(0.326%)   >>76.855%<<
expanded_conv_5/depthwise/depthwise output reformatter 0:  0.130ms(0.340%)   >>75.521%<<
expanded_conv_5/depthwise/Relu6                    :   0.058ms(0.151%)   >>90.174%<<
expanded_conv_5/depthwise/Relu6/relu1              :   0.050ms(0.130%)   >>92.406%<<
expanded_conv_5/depthwise/Relu6/sub1 + FeatureExtractor/MobilenetV2/expanded_conv_5/depthwise/Relu6/relu2:   0.076ms(0.200%)   >>83.716%<<

project

expanded_conv_5/project/Conv2D + FeatureExtractor/MobilenetV2/expanded_conv_5/add:   0.070ms(0.184%)   >>85.457%<<

Hello,

starting in TRT next(tentatively version 6) , we optimized 1x1 conv (gemm_as_1x1conv kernels) as well as adding support for different input format.