I did some test on GeForce RTX 2080 and GeForce GTX 1080, and find doing small matrix multiplication like [256, 256] * [256, 256], 2080 will pay more time than 1080. It seems like 2080 will do slower until the matrix size is large than [1024, 1024].
You can just test the cudnn 7.3 sample (conv_sample) to see this phenomenon, since this sample use image shape [1, 32 ,4 ,4] which is not large enough. I try it with nvidia driver 410.57 + cuda 10 + cudnn 7.3, and get these result:
On GeForce GTX 1080:
Using format CUDNN_TENSOR_NCHW (for INT8x4 and INT8x32 tests use CUDNN_TENSOR_NCHW_VECT_C)
Testing single precision
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 3.60012e-05 sec,
Test PASSED
Testing half precision (math in single precision)
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 2.59876e-05 sec,
Test PASSED
On GeForce RTX 2080:
Using format CUDNN_TENSOR_NCHW (for INT8x4 and INT8x32 tests use CUDNN_TENSOR_NCHW_VECT_C)
Testing single precision
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 5.79357e-05 sec,
Test PASSED
Testing half precision (math in single precision)
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 4.00543e-05 sec,
Test PASSED
Pay attention to “^^^^ CUDA : elapsed”, and you can know 2080 spends more time.
You can also test the cublas function cublasSgemm(…) or cublasGemmEx(…). 1080 do faster in all these case, even faster than 1080Ti.