Deconvolution Layer runs super slow in TensorRT


I got a problem porting a Caffe model to TensorRT. All the other layers work great. But the last deconvolution layer runs super slow. Here’s the profiling data of the model running with TensorRT.

conv1                                    0.088ms
relu_conv1                               0.081ms
conv2                                    0.275ms
relu_conv2                               0.082ms
conv3_1                                  0.274ms
relu_conv3_1                             0.061ms
conv3_2                                  0.272ms
relu_conv3_2                             0.041ms
conv3_3                                  0.198ms
relu_conv3_3                             0.081ms
slice1                                   0.084ms
conv3_4                                  0.231ms
relu_conv3_4                             0.082ms
conv3_5                                  0.274ms
relu_conv3_5                             0.076ms
conv3_6                                  0.373ms
relu_conv3_6                             0.101ms
conv2 copy                               0.082ms
slice1_1 copy                            0.024ms
sum1                                     0.149ms
down1                                    0.115ms
relu_down1                               0.080ms
conv4_1                                  0.272ms
relu_conv4_1                             0.061ms
conv4_2                                  0.270ms
relu_conv4_2                             0.041ms
conv4_3                                  0.201ms
relu_conv4_3                             0.081ms
slice2                                   0.084ms
conv4_4                                  0.231ms
relu_conv4_4                             0.108ms
conv4_5                                  0.297ms
relu_conv4_5                             0.061ms
conv4_6                                  0.323ms
relu_conv4_6                             0.101ms
down1 copy                               0.082ms
slice2_1 copy                            0.023ms
sum2                                     0.150ms
down2                                    0.115ms
relu_down2                               0.080ms
conv5_1                                  0.275ms
relu_conv5_1                             0.061ms
conv5_2                                  0.270ms
relu_conv5_2                             0.041ms
conv5_3                                  0.201ms
relu_conv5_3                             0.081ms
slice3                                   0.084ms
conv5_4                                  0.229ms
relu_conv5_4                             0.081ms
conv5_5                                  0.273ms
relu_conv5_5                             0.061ms
conv5_6                                  0.324ms
relu_conv5_6                             0.101ms
down2 copy                               0.082ms
slice3_1 copy                            0.023ms
sum3                                     0.150ms
down3                                    0.114ms
relu_down3                               0.080ms
conv6_1                                  0.274ms
relu_conv6_1                             0.061ms
conv6_2                                  0.315ms
relu_conv6_2                             0.072ms
conv6_3                                  0.200ms
relu_conv6_3                             0.081ms
slice4                                   0.083ms
conv6_4                                  0.230ms
relu_conv6_4                             0.081ms
conv6_5                                  0.273ms
relu_conv6_5                             0.061ms
conv6_6                                  0.324ms
relu_conv6_6                             0.101ms
down3 copy                               0.082ms
slice4_1 copy                            0.023ms
sum4                                     0.150ms
down4                                    0.115ms
relu_down4                               0.080ms
upsample                                 224.982ms
Time over all layers: 235.824

Here’s the Caffe prototxt of this last deconv layer.

layer {
    name: "upsample"
    type: "Deconvolution"
    bottom: "down4"
    top: "upsample"
    convolution_param {
        kernel_size: 17
        stride: 2
        num_output: 1
        pad: 8

I’m using a Nvidia Titan XP GPU. This model takes around 16ms running with Caffe (without TensorRT acceleration.) So there’s no reason to have more than 200ms latency. Can anyone help me with this issue. Thanks.

With the current release (June 2018) it seems they fixed that issue with the deconvolution layers slow inference time (Checkthe attached picture or this link , I think we don’t have previews):

Haven’t tried yet but I’ll come back to this question, I had so many issues getting deconvolutions to work on my deployment.

EDIT: Nevermind, it seems that the times are still super slow, tested with tensorRT 4.1 on a 1080Ti with FNC-AlexNet and a 1280x720p image I got:

[TRT]  layer shift - 0.919552 ms
[TRT]  layer conv1 + relu1 - 1.481728 ms
[TRT]  layer pool1 - 0.116832 ms
[TRT]  layer norm1 - 0.056224 ms
[TRT]  layer conv2 + relu2 - 1.145856 ms
[TRT]  layer pool2 - 0.087520 ms
[TRT]  layer norm2 - 0.097824 ms
[TRT]  layer conv3 + relu3 - 0.546816 ms
[TRT]  layer conv4 + relu4 - 0.425984 ms
[TRT]  layer conv5 + relu5 - 0.280576 ms
[TRT]  layer pool5 - 0.027648 ms
[TRT]  layer fc6 + relu6 - 8.728576 ms
[TRT]  layer fc7 + relu7 - 3.900512 ms
[TRT]  layer score_fr - 0.097184 ms
[TRT]  layer upscore - 3986.665527 ms
[TRT]  layer network time - 4004.578369 ms
[TRT]  segNet::Overlay -- s_w 1343  s_h 767  s_c 21  s_x 1.049219  s_y 1.065278
[TRT]  segNet::Overlay -- ignoring class 'void' id=0