Cudnn 7.3 has poor performance on GeForce RTX 2080

yangruiheng1 · October 12, 2018, 8:39am

I use GeForce RTX 2080 + nvidia driver 410.57 + cuda 10.0 + cudnn 7.3 and find my mxnet net run slower then GeForce GTX 1080 + nvidia driver 410.57 + cuda 10.0 + cudnn 7.3

Then I tried the cudnn conv_sample, and get:

On GeForce GTX 1080:

Using format CUDNN_TENSOR_NCHW (for INT8x4 and INT8x32 tests use CUDNN_TENSOR_NCHW_VECT_C)
Testing single precision
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 3.60012e-05 sec,
Test PASSED
Testing half precision (math in single precision)
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 2.59876e-05 sec,
Test PASSED

On GeForce GTX 2080:

Using format CUDNN_TENSOR_NCHW (for INT8x4 and INT8x32 tests use CUDNN_TENSOR_NCHW_VECT_C)
Testing single precision
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 5.79357e-05 sec,
Test PASSED
Testing half precision (math in single precision)
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 4.00543e-05 sec,
Test PASSED

Pay attention to “^^^^ CUDA : elapsed”, and you can know 2080 spends more time.
I got the same result after using cudaEvent to record time spent on GPU by testing the function cudnnConvolutionForward(…)

However, when I use cudaEvent to record time spend on cublas function cublasSgemm(…) and cublasHgemm(…), I find 2080 run quickly and more quickly when using tensor core.

So why cudnnConvolutionForward(…) has worse performance on 2080? Is it a weakness of cudnn 7.3?

Topic		Replies	Views
Is GeForce RTX 2080 slower than GeForce GTX 1080 on small matrix-matrix multiplication? CUDA Programming and Performance	12	2773	October 25, 2018
Low performance for convolution in cuDNN on Tesla V100 cuDNN	5	2137	August 2, 2018
Cudnn8 regression on TX2 Jetson TX2 nvbugs , cudnn	6	1126	October 18, 2021
CUDNN: cudnnConvolutionForward very bad performance(very long execution time) on xavier Jetson AGX Xavier	4	1099	October 18, 2021
RTX 2080 TI Supported? TensorRT	10	3689	September 17, 2019
some questions about cudnn7.2? cuDNN	0	619	August 17, 2018
Why is 2-D convolution slower than the matrix product? CUDA Programming and Performance	17	6924	April 18, 2015
cudnn status execution failed error 2080ti cuDNN	2	962	June 19, 2019
CUFFT poor on GTX 1080 (Linux, CUDA 8.0, nvidia-367) GPU-Accelerated Libraries	3	1382	November 4, 2016
Int8 is 30% slower than fp16 in cudnn_samples_v8/conv_sample cuDNN	4	835	February 8, 2023

Cudnn 7.3 has poor performance on GeForce RTX 2080

Related topics