I use GeForce RTX 2080 + nvidia driver 410.57 + cuda 10.0 + cudnn 7.3 and find my mxnet net run slower then GeForce GTX 1080 + nvidia driver 410.57 + cuda 10.0 + cudnn 7.3

Then I tried the cudnn conv_sample, and get:

On GeForce GTX 1080:

Using format CUDNN_TENSOR_NCHW (for INT8x4 and INT8x32 tests use CUDNN_TENSOR_NCHW_VECT_C)

Testing single precision

====USER DIMENSIONS====

input dims are 1, 32, 4, 4

filter dims are 32, 32, 1, 1

output dims are 1, 32, 4, 4

====PADDING DIMENSIONS====

padded input dims are 1, 32, 4, 4

padded filter dims are 32, 32, 1, 1

padded output dims are 1, 32, 4, 4

Testing conv

^^^^ CUDA : elapsed = 3.60012e-05 sec,

Test PASSED

Testing half precision (math in single precision)

====USER DIMENSIONS====

input dims are 1, 32, 4, 4

filter dims are 32, 32, 1, 1

output dims are 1, 32, 4, 4

====PADDING DIMENSIONS====

padded input dims are 1, 32, 4, 4

padded filter dims are 32, 32, 1, 1

padded output dims are 1, 32, 4, 4

Testing conv

^^^^ CUDA : elapsed = 2.59876e-05 sec,

Test PASSED

On GeForce GTX 2080:

Using format CUDNN_TENSOR_NCHW (for INT8x4 and INT8x32 tests use CUDNN_TENSOR_NCHW_VECT_C)

Testing single precision

====USER DIMENSIONS====

input dims are 1, 32, 4, 4

filter dims are 32, 32, 1, 1

output dims are 1, 32, 4, 4

====PADDING DIMENSIONS====

padded input dims are 1, 32, 4, 4

padded filter dims are 32, 32, 1, 1

padded output dims are 1, 32, 4, 4

Testing conv

^^^^ CUDA : elapsed = 5.79357e-05 sec,

Test PASSED

Testing half precision (math in single precision)

====USER DIMENSIONS====

input dims are 1, 32, 4, 4

filter dims are 32, 32, 1, 1

output dims are 1, 32, 4, 4

====PADDING DIMENSIONS====

padded input dims are 1, 32, 4, 4

padded filter dims are 32, 32, 1, 1

padded output dims are 1, 32, 4, 4

Testing conv

^^^^ CUDA : elapsed = 4.00543e-05 sec,

Test PASSED

Pay attention to “^^^^ CUDA : elapsed”, and you can know 2080 spends more time.

I got the same result after using cudaEvent to record time spent on GPU by testing the function cudnnConvolutionForward(…)

However, when I use cudaEvent to record time spend on cublas function cublasSgemm(…) and cublasHgemm(…), I find 2080 run quickly and more quickly when using tensor core.

So why cudnnConvolutionForward(…) has worse performance on 2080? Is it a weakness of cudnn 7.3?