TensorRT 2x slower than Cudnn for single Conv2D (74 ms vs. 156 ms)

Description

I’m trying to use TensorRT to optimize the performance of a TensorFlow Mask RCNN model from TF’s model zoo. It’s mask_rcnn_resnet50_atrous_coco from this page.

I noticed that my performance with TensorRT was much worse than Cudnn, and I traced it down to 4 Conv2d ops which were responsible for the drop in performance.

Looking at just the first conv2d, which has input shape [100, 2048, 33, 33], kernel shape [256, 2048, 3, 3], and padding [1, 1], I gathered some data from cudnn and TensorRT verbose logs and nvprof to find out which kernel each is choosing.

With cudnn, I can see that volta_sgemm_128x128_nn is chosen which takes ~74 ms:

CUDNN Found 8 fwd algorithms, choosing CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED
0) CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED - time: 74.7333 ms, Memory: 3085369344
1) CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED - time: 79.6577 ms, Memory: 3085369344
2) CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM - time: 154.622 ms, Memory: 18882048
3) CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM - time: 155.829 ms, Memory: 18882048
4) CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM - time: 201.082 ms, Memory: 0
5) CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD - time: 205.48 ms, Memory: 52429824
6) CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING - time: 261.588 ms, Memory: 4287102976
7) CUDNN_CONVOLUTION_FWD_ALGO_GEMM - time: 294.803 ms, Memory: 8028979200

With TensorRT verbose output, the fast sgemm kernel doesn’t appear to be present. TensorRT choses volta_scudnn_128x128_relu_small_nn_v1 which takes about 156ms.

[TensorRT] VERBOSE: Tactic: 0 time 238.706
[TensorRT] VERBOSE: Tactic: 1 time 156.551
[TensorRT] VERBOSE: Tactic: 2 time 313.44
[TensorRT] VERBOSE: Tactic: 5 time 248.929
[TensorRT] VERBOSE: Tactic: 6 time 210.186
[TensorRT] VERBOSE: Tactic: 56 time 228.554
[TensorRT] VERBOSE: Tactic: 57 time 158.353
[TensorRT] VERBOSE: Tactic: 58 time 314.843
[TensorRT] VERBOSE: Tactic: 61 time 248.857
[TensorRT] VERBOSE: Tactic: 62 time 221.985
[TensorRT] VERBOSE: Fastest Tactic: 1 Time: 156.551

It seems like this is a bug that TensorRT is not considering the same fast kernel that cudnn can use.

Environment

GPU Type: Tesla T4
Container: nvcr.io/nvidia/tensorrt:20.12-py3
CUDA version: 10.0, 10.2, 11.1
Cudnn Version: 7.6.3, 8.0.5
TensorRT version: 6.0.1.5, 7.0.0.11, 7.2.2.3

Relevant Files

Logs from cudnn + nvprof: https://gist.github.com/trevor-m/2fe5f6451a7739bf7493e8b91d2a2e4c#file-cudnn-log

Logs from tensorrt verbose output + nvprof: https://gist.github.com/trevor-m/2fe5f6451a7739bf7493e8b91d2a2e4c#file-tensorrt-log

Steps To Reproduce

This script creates a network containing the first conv2d op which TRT picks a slow kernel for. Verbose output will show kernel speeds.
TensorRT python script: https://gist.github.com/trevor-m/511d17c496c7f110b9eb4715ef572496

Excerpt from full network definition containing the relevant ops. The other 3 conv2d’s are similarly affected by the problem: https://gist.github.com/trevor-m/a32ac7010a148045bb26b5ce22566220

Hi, Request you to share the model, script, profiler and performance output so that we can help you better.

Alternatively, you can try running your model with trtexec command

or view these tips for optimizing performance

Thanks!

Hi, thanks for the response. I’ve already included links to the profiler output, and repro script.

@trevmorr,
From logs we couldn’t identify TRT or cuDNN version.
Could you please share.

Thank you

Hi @spolisetty thanks for the response!

I was able to reproduce this in multiple environments:
CUDA 10.0 + TRT 6.0.1.5 + CUDNN 7.6.3
CUDA 10.2 + TRT 7.0.0.11 + CUDNN 8.0.5
CUDA 10.2 + TRT 7.2.2.3 + CUDNN 8.0.5
NGC TensorRT container nvcr.io/nvidia/tensorrt:20.12-py3: CUDA 11.1 + TRT 7.2.2.1 + CUDNN 8.0.5

Hi @trevmorr,

Could you please provide us TVM repo - invoke script (including model definition in whichever format), TVM version and the TVM runtime executables that the you have trained.
Which will be very helpful.

Thank you.

Hi @spolisetty, thanks for the reply.

Here is a TVM script to reproduce the result.

Example output:

Compiling with cudnn.
Mean inference time (std dev): 136.11 ms (0.49 ms)
Compiling with tensorrt.
Mean inference time (std dev): 272.34 ms (0.90 ms)

It will compile and run the subgraph using cudnn and then TRT. You can use the latest apache/tvm commit, compiled with USE_CUDA, USE_CUDNN enabled. Optionally you can enable USE_TENSORRT_CODEGEN and USE_TENSORRT_RUNTIME to reproduce the trt results as well, tutorial here (Relay TensorRT Integration — tvm 0.8.dev0 documentation).

1 Like