Cudnn 7.3 has poor performance on GeForce RTX 2080

I use GeForce RTX 2080 + nvidia driver 410.57 + cuda 10.0 + cudnn 7.3 and find my mxnet net run slower then GeForce GTX 1080 + nvidia driver 410.57 + cuda 10.0 + cudnn 7.3

Then I tried the cudnn conv_sample, and get:

On GeForce GTX 1080:

Using format CUDNN_TENSOR_NCHW (for INT8x4 and INT8x32 tests use CUDNN_TENSOR_NCHW_VECT_C)
Testing single precision
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 3.60012e-05 sec,
Test PASSED
Testing half precision (math in single precision)
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 2.59876e-05 sec,
Test PASSED

On GeForce GTX 2080:

Using format CUDNN_TENSOR_NCHW (for INT8x4 and INT8x32 tests use CUDNN_TENSOR_NCHW_VECT_C)
Testing single precision
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 5.79357e-05 sec,
Test PASSED
Testing half precision (math in single precision)
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 4.00543e-05 sec,
Test PASSED

Pay attention to “^^^^ CUDA : elapsed”, and you can know 2080 spends more time.
I got the same result after using cudaEvent to record time spent on GPU by testing the function cudnnConvolutionForward(…)

However, when I use cudaEvent to record time spend on cublas function cublasSgemm(…) and cublasHgemm(…), I find 2080 run quickly and more quickly when using tensor core.

So why cudnnConvolutionForward(…) has worse performance on 2080? Is it a weakness of cudnn 7.3?