I did some test on GeForce RTX 2080 and GeForce GTX 1080, and find doing small matrix multiplication like [256, 256] * [256, 256], 2080 will pay more time than 1080. It seems like 2080 will do slower until the matrix size is large than [1024, 1024].

You can just test the cudnn 7.3 sample (conv_sample) to see this phenomenon, since this sample use image shape [1, 32 ,4 ,4] which is not large enough. I try it with nvidia driver 410.57 + cuda 10 + cudnn 7.3, and get these result:

On GeForce GTX 1080:

Using format CUDNN_TENSOR_NCHW (for INT8x4 and INT8x32 tests use CUDNN_TENSOR_NCHW_VECT_C)

Testing single precision

====USER DIMENSIONS====

input dims are 1, 32, 4, 4

filter dims are 32, 32, 1, 1

output dims are 1, 32, 4, 4

====PADDING DIMENSIONS====

padded input dims are 1, 32, 4, 4

padded filter dims are 32, 32, 1, 1

padded output dims are 1, 32, 4, 4

Testing conv

^^^^ CUDA : elapsed = 3.60012e-05 sec,

Test PASSED

Testing half precision (math in single precision)

====USER DIMENSIONS====

input dims are 1, 32, 4, 4

filter dims are 32, 32, 1, 1

output dims are 1, 32, 4, 4

====PADDING DIMENSIONS====

padded input dims are 1, 32, 4, 4

padded filter dims are 32, 32, 1, 1

padded output dims are 1, 32, 4, 4

Testing conv

^^^^ CUDA : elapsed = 2.59876e-05 sec,

Test PASSED

On GeForce RTX 2080:

Using format CUDNN_TENSOR_NCHW (for INT8x4 and INT8x32 tests use CUDNN_TENSOR_NCHW_VECT_C)

Testing single precision

====USER DIMENSIONS====

input dims are 1, 32, 4, 4

filter dims are 32, 32, 1, 1

output dims are 1, 32, 4, 4

====PADDING DIMENSIONS====

padded input dims are 1, 32, 4, 4

padded filter dims are 32, 32, 1, 1

padded output dims are 1, 32, 4, 4

Testing conv

^^^^ CUDA : elapsed = 5.79357e-05 sec,

Test PASSED

Testing half precision (math in single precision)

====USER DIMENSIONS====

input dims are 1, 32, 4, 4

filter dims are 32, 32, 1, 1

output dims are 1, 32, 4, 4

====PADDING DIMENSIONS====

padded input dims are 1, 32, 4, 4

padded filter dims are 32, 32, 1, 1

padded output dims are 1, 32, 4, 4

Testing conv

^^^^ CUDA : elapsed = 4.00543e-05 sec,

Test PASSED

Pay attention to “^^^^ CUDA : elapsed”, and you can know 2080 spends more time.

You can also test the cublas function cublasSgemm(…) or cublasGemmEx(…). 1080 do faster in all these case, even faster than 1080Ti.