Is GeForce RTX 2080 slower than GeForce GTX 1080 on small matrix-matrix multiplication?

I did some test on GeForce RTX 2080 and GeForce GTX 1080, and find doing small matrix multiplication like [256, 256] * [256, 256], 2080 will pay more time than 1080. It seems like 2080 will do slower until the matrix size is large than [1024, 1024].

You can just test the cudnn 7.3 sample (conv_sample) to see this phenomenon, since this sample use image shape [1, 32 ,4 ,4] which is not large enough. I try it with nvidia driver 410.57 + cuda 10 + cudnn 7.3, and get these result:

On GeForce GTX 1080:

Using format CUDNN_TENSOR_NCHW (for INT8x4 and INT8x32 tests use CUDNN_TENSOR_NCHW_VECT_C)
Testing single precision
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 3.60012e-05 sec,
Test PASSED
Testing half precision (math in single precision)
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 2.59876e-05 sec,
Test PASSED

On GeForce RTX 2080:

Using format CUDNN_TENSOR_NCHW (for INT8x4 and INT8x32 tests use CUDNN_TENSOR_NCHW_VECT_C)
Testing single precision
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 5.79357e-05 sec,
Test PASSED
Testing half precision (math in single precision)
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 4.00543e-05 sec,
Test PASSED

Pay attention to “^^^^ CUDA : elapsed”, and you can know 2080 spends more time.

You can also test the cublas function cublasSgemm(…) or cublasGemmEx(…). 1080 do faster in all these case, even faster than 1080Ti.

You could attempt to use nvprof or nvvp to see how much time is spent in which kernels, and what the achieved occupancy is.

However be advised that kernels using dynamic parallelism can not be profiled on Compute 7.x hardware under CUDA 10. In nvprof, not even the aggregate runtime of the calling kernel that uses dynamic parallelism is shown.

I suspect that the matrix operation only executes on a small number of SMs, making this effectively a benchmark for single multiprocessor performance. So the larger number of SMs on Volta/Turing will not allow it to shine here.

I also suspect that the 1080Ti emulates the operation on more SMs than the 2080 card, which may try to execute it on fewer SM by using the Tensor cores. This might be a reason that the 1080Ti can outpace the 2080. Multiple SMs cooperating on the 1080Ti could be faster than a single SM doing the same operation on a Volta or Turing device.

Have you attempted to plot runtime vs matrix size for both devices?

Christian

P.S.:
SM = shader multiprocessor

Sorry to be a “besserwisser”, but it’s:

Streaming Multiprocessor (SM)

Although your interpretation is not exactly wrong :-)

@yangruiheng1 : if your total data size is 1324*4 I agree it’s definitely very small.

In fact your measurements were ~

25us VS 36us [us = microseconds]

That’s a very small job and likely approaching the actual measurement accuracy (AFAIK). From experience, when I have kernels that run for a few [ms] 1000-3000 us the timings can easily vary 100-200 us even in the nvprof. For example the GPU:s normally ramp the clocks up and down to conserve power, so warm-ups are required for accurate measurements.

For reference, just the overhead of launching a kernel used to be 4 us (i believe it is lower nowadays).

You could:

→ try running your kernel 100-1000 times and take an average.

Yes, I have tried to plot runtime vs matrix size for both device, and I got :
External Media
External Media
The y-axis is time(ms), and x-axis is the shape of matrixes(e.g. [128, 128]*[128, 128]).

Have you found the similar phenomenon? The devices I compared is 2080 and 1080, not 2080 and 1080Ti. Due to some neural network won’t use large matrix-matrix multiplication(like imagenet-resnet-18), I want to know whether it’s worth to replace 1080 with 2080. And it seems like 2080 isn’t faster than 1080 in a single convolution layer(I just test cudnn function):

image size [1, 8, x, x]
filter size [32, 8, 8, 8]
pad [0, 0]
convstride[1, 1]
External Media

@Jimmy Pettersson :
I have tried, I run cublasSgemm(…) more than 50 times and calculate the average time. The result is 2080 has poor performance compared to 1080 util the matrix size large than [1024, 1024].

For cudnn function cudnnConverlutionForward(…), 2080 is slow than 1080 sometime (e.g. image size [1, 8, 512, 512], filter size [32, 8, 8, 8], pad [0, 0], convstride[1, 1]).

The FP16 performance of the Turing’s tensor cores seems to be half compared to Volta’s performance. Another reason the yellow bars in your last graph are comparatively high (worse than 1080 cards)

Here’s the related thread talking about the performance gap
https://devtalk.nvidia.com/default/topic/1042948/cuda-programming-and-performance/2080ti-vs-titan-v/

Nice graphs!

I note that for the SGEMM the 2080 is faster already at 512x512 matrix size? (FP16 tensor core).

Looking at the larger matrix sizes your numbers make sense:

The gain for 6K*6K is around 61ms → 45ms ~ 1.3x speedup, which is similar to the compute and bandwidth increases:

1080 VS 2080
352 GB/s VS 448 GB/s
9216 GFLOPS VS 12541 GB/s

So the numbers don’t hold for smaller cases, regardless if using the tensor cores or FP16, I wonder if it could be that the 2080 has trouble hiding latencies for smaller datasets OR if this is a GDDR6 issue. If so it might be fixed in future driver releases.

The performance artifacts people reported when comparing the GTX 1080 with GDDR5 with the GTX 1080 Ti with GDDR5X suggest that the root cause may be increased latencies in the DRAM which cannot be overcome completely by the latency tolerance designed into the GPU. So this would make it not strictly an either-or causation, more the result of a combination of design compromises.

As far as I understand, GDDR6 is just a refinement of GDDR5X, so fundamental trade-offs (presumably favorable bandwidth vs less favorable latency) have not changed. Another way to look at it is that we have reached the “coffin corner” of classical GDDR design. Since the increased bandwidth of GDDR6 benefits the majority of GPU use case, and it remains relatively cheap compared to next-generation memory technology like HBM2, I suspect this will be with us for quite some time, especially in the consumer space.

Moore’s Law is basically dead for processors, DRAM, and mass storage, so I wouldn’t expect any dramatic improvements from here on out. We have entered the world of incremental refinements, like many other mature industries.

As a consequence, users should not simply treat a new GPU architecture as a “rising tide that lifts all boats”, but rather as a solution benefiting specific application profiles. Whether Pascal, Volta, or Turing is the most appropriate solution is best determined by benchmarking one’s specific use case(s), then making decisions based on the outcome.

Yes, it seem like 2080 isn’t better than 1080 in all cases. But I have found some problems may come from cudnn rather than 2080.

In the graphs I have put, 2080 perform poorly in convolution layer. And I use nvprof to check up the progresses running in GPU, I found when I specify input as:
image size [1, 8, 512, 512]
filter size [32, 8, 3, 3]
pad [1, 1]
convstride[1, 1]
The kernel running is :
1080 FP32: maxwell_scudnn_128x32_relu_small_nn 334.86us
2080 FP32: volta_scudnn_128x32_relu_small_nn_v1 237.28us
2080 FP16 tensor core: turing_fp16_s1688cudnn_fp16_256x64_ldg8_relu_f2f_exp_small_nhwc_tn_v1 416.75us

2080 FP32 is quicker than 1080 FP32 at this time but 2080 FP16 tensor core slow down. It’s interesting that 1080 and 2080 kernel has 128x32 in their name but tensor core has 256x64.

I guess nvidia should do some optimization on cudnn.

@njuffa, the latency argument makes a lot of sense.

I was trying very hard to find out why 2080Ti tensor cores were half as fast as Titan V tensor cores.

The reason is that they can only do FP32 accumulate at half speed. Titan V tensors and infact Quadro RTX tensors(!!) do full speed.

So they did gimp the tensor cores for the consumer models of RTX.


tera mentioned it here: [url]https://devtalk.nvidia.com/default/topic/1042948/cuda-programming-and-performance/2080ti-vs-titan-v/post/5292507/#5292507[/url] but I didnt notice. Nvidia has documented this behaviour in their whitepaper

Thank you! This document helps me a lot!