I noticed that my performance with TensorRT was much worse than Cudnn, and I traced it down to 4 Conv2d ops which were responsible for the drop in performance.
Looking at just the first conv2d, which has input shape [100, 2048, 33, 33], kernel shape [256, 2048, 3, 3], and padding [1, 1], I gathered some data from cudnn and TensorRT verbose logs and nvprof to find out which kernel each is choosing.
With cudnn, I can see that volta_sgemm_128x128_nn is chosen which takes ~74 ms:
CUDNN Found 8 fwd algorithms, choosing CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED 0) CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED - time: 74.7333 ms, Memory: 3085369344 1) CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED - time: 79.6577 ms, Memory: 3085369344 2) CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM - time: 154.622 ms, Memory: 18882048 3) CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM - time: 155.829 ms, Memory: 18882048 4) CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM - time: 201.082 ms, Memory: 0 5) CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD - time: 205.48 ms, Memory: 52429824 6) CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING - time: 261.588 ms, Memory: 4287102976 7) CUDNN_CONVOLUTION_FWD_ALGO_GEMM - time: 294.803 ms, Memory: 8028979200
With TensorRT verbose output, the fast sgemm kernel doesn’t appear to be present. TensorRT choses volta_scudnn_128x128_relu_small_nn_v1 which takes about 156ms.
[TensorRT] VERBOSE: Tactic: 0 time 238.706 [TensorRT] VERBOSE: Tactic: 1 time 156.551 [TensorRT] VERBOSE: Tactic: 2 time 313.44 [TensorRT] VERBOSE: Tactic: 5 time 248.929 [TensorRT] VERBOSE: Tactic: 6 time 210.186 [TensorRT] VERBOSE: Tactic: 56 time 228.554 [TensorRT] VERBOSE: Tactic: 57 time 158.353 [TensorRT] VERBOSE: Tactic: 58 time 314.843 [TensorRT] VERBOSE: Tactic: 61 time 248.857 [TensorRT] VERBOSE: Tactic: 62 time 221.985 [TensorRT] VERBOSE: Fastest Tactic: 1 Time: 156.551
It seems like this is a bug that TensorRT is not considering the same fast kernel that cudnn can use.
GPU Type: Tesla T4
CUDA version: 10.0, 10.2, 11.1
Cudnn Version: 7.6.3, 8.0.5
TensorRT version: 126.96.36.199, 188.8.131.52, 184.108.40.206
Logs from cudnn + nvprof: https://gist.github.com/trevor-m/2fe5f6451a7739bf7493e8b91d2a2e4c#file-cudnn-log
Logs from tensorrt verbose output + nvprof: https://gist.github.com/trevor-m/2fe5f6451a7739bf7493e8b91d2a2e4c#file-tensorrt-log
Steps To Reproduce
This script creates a network containing the first conv2d op which TRT picks a slow kernel for. Verbose output will show kernel speeds.
TensorRT python script: https://gist.github.com/trevor-m/511d17c496c7f110b9eb4715ef572496
Excerpt from full network definition containing the relevant ops. The other 3 conv2d’s are similarly affected by the problem: https://gist.github.com/trevor-m/a32ac7010a148045bb26b5ce22566220