Is GeForce RTX 2080 slower than GeForce GTX 1080 on small matrix-matrix multiplication?

yangruiheng1 · October 16, 2018, 2:08am

I did some test on GeForce RTX 2080 and GeForce GTX 1080, and find doing small matrix multiplication like [256, 256] * [256, 256], 2080 will pay more time than 1080. It seems like 2080 will do slower until the matrix size is large than [1024, 1024].

You can just test the cudnn 7.3 sample (conv_sample) to see this phenomenon, since this sample use image shape [1, 32 ,4 ,4] which is not large enough. I try it with nvidia driver 410.57 + cuda 10 + cudnn 7.3, and get these result:

On GeForce GTX 1080:

Using format CUDNN_TENSOR_NCHW (for INT8x4 and INT8x32 tests use CUDNN_TENSOR_NCHW_VECT_C)
Testing single precision
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 3.60012e-05 sec,
Test PASSED
Testing half precision (math in single precision)
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 2.59876e-05 sec,
Test PASSED

On GeForce RTX 2080:

Using format CUDNN_TENSOR_NCHW (for INT8x4 and INT8x32 tests use CUDNN_TENSOR_NCHW_VECT_C)
Testing single precision
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 5.79357e-05 sec,
Test PASSED
Testing half precision (math in single precision)
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Testing conv
^^^^ CUDA : elapsed = 4.00543e-05 sec,
Test PASSED

Pay attention to “^^^^ CUDA : elapsed”, and you can know 2080 spends more time.

You can also test the cublas function cublasSgemm(…) or cublasGemmEx(…). 1080 do faster in all these case, even faster than 1080Ti.

cbuchner1 · October 16, 2018, 12:27pm

You could attempt to use nvprof or nvvp to see how much time is spent in which kernels, and what the achieved occupancy is.

However be advised that kernels using dynamic parallelism can not be profiled on Compute 7.x hardware under CUDA 10. In nvprof, not even the aggregate runtime of the calling kernel that uses dynamic parallelism is shown.

I suspect that the matrix operation only executes on a small number of SMs, making this effectively a benchmark for single multiprocessor performance. So the larger number of SMs on Volta/Turing will not allow it to shine here.

I also suspect that the 1080Ti emulates the operation on more SMs than the 2080 card, which may try to execute it on fewer SM by using the Tensor cores. This might be a reason that the 1080Ti can outpace the 2080. Multiple SMs cooperating on the 1080Ti could be faster than a single SM doing the same operation on a Volta or Turing device.

Have you attempted to plot runtime vs matrix size for both devices?

Christian

P.S.:
SM = shader multiprocessor

Jimmy_Pettersson · October 16, 2018, 2:40pm

Sorry to be a “besserwisser”, but it’s:

Streaming Multiprocessor (SM)

Although your interpretation is not exactly wrong :-)

Jimmy_Pettersson · October 16, 2018, 2:49pm

@yangruiheng1 : if your total data size is 1324*4 I agree it’s definitely very small.

In fact your measurements were ~

25us VS 36us [us = microseconds]

That’s a very small job and likely approaching the actual measurement accuracy (AFAIK). From experience, when I have kernels that run for a few [ms] 1000-3000 us the timings can easily vary 100-200 us even in the nvprof. For example the GPU:s normally ramp the clocks up and down to conserve power, so warm-ups are required for accurate measurements.

For reference, just the overhead of launching a kernel used to be 4 us (i believe it is lower nowadays).

You could:

→ try running your kernel 100-1000 times and take an average.

yangruiheng1 · October 16, 2018, 3:27pm

You could attempt to use nvprof or nvvp to see how much time is spent in which kernels, and what the achieved occupancy is.

However be advised that kernels using dynamic parallelism can not be profiled on Compute 7.x hardware under CUDA 10. In nvprof, not even the aggregate runtime of the calling kernel that uses dynamic parallelism is shown.

I suspect that the matrix operation only executes on a small number of SMs, making this effectively a benchmark for single multiprocessor performance. So the larger number of SMs on Volta/Turing will not allow it to shine here.

I also suspect that the 1080Ti emulates the operation on more SMs than the 2080 card, which may try to execute it on fewer SM by using the Tensor cores. This might be a reason that the 1080Ti can outpace the 2080. Multiple SMs cooperating on the 1080Ti could be faster than a single SM doing the same operation on a Volta or Turing device.

Have you attempted to plot runtime vs matrix size for both devices?

Christian

P.S.:
SM = shader multiprocessor

Yes, I have tried to plot runtime vs matrix size for both device, and I got :
External Media
External Media
The y-axis is time(ms), and x-axis is the shape of matrixes(e.g. [128, 128]*[128, 128]).

Have you found the similar phenomenon? The devices I compared is 2080 and 1080, not 2080 and 1080Ti. Due to some neural network won’t use large matrix-matrix multiplication(like imagenet-resnet-18), I want to know whether it’s worth to replace 1080 with 2080. And it seems like 2080 isn’t faster than 1080 in a single convolution layer(I just test cudnn function):

image size [1, 8, x, x]
filter size [32, 8, 8, 8]
pad [0, 0]
convstride[1, 1]
External Media

yangruiheng1 · October 16, 2018, 3:46pm

@Jimmy Pettersson :
I have tried, I run cublasSgemm(…) more than 50 times and calculate the average time. The result is 2080 has poor performance compared to 1080 util the matrix size large than [1024, 1024].

For cudnn function cudnnConverlutionForward(…), 2080 is slow than 1080 sometime (e.g. image size [1, 8, 512, 512], filter size [32, 8, 8, 8], pad [0, 0], convstride[1, 1]).

cbuchner1 · October 17, 2018, 9:11am

The FP16 performance of the Turing’s tensor cores seems to be half compared to Volta’s performance. Another reason the yellow bars in your last graph are comparatively high (worse than 1080 cards)

Here’s the related thread talking about the performance gap
https://devtalk.nvidia.com/default/topic/1042948/cuda-programming-and-performance/2080ti-vs-titan-v/

Jimmy_Pettersson · October 17, 2018, 11:53am

Nice graphs!

I note that for the SGEMM the 2080 is faster already at 512x512 matrix size? (FP16 tensor core).

Looking at the larger matrix sizes your numbers make sense:

The gain for 6K*6K is around 61ms → 45ms ~ 1.3x speedup, which is similar to the compute and bandwidth increases:

1080 VS 2080
352 GB/s VS 448 GB/s
9216 GFLOPS VS 12541 GB/s

So the numbers don’t hold for smaller cases, regardless if using the tensor cores or FP16, I wonder if it could be that the 2080 has trouble hiding latencies for smaller datasets OR if this is a GDDR6 issue. If so it might be fixed in future driver releases.

njuffa · October 17, 2018, 6:53pm

The performance artifacts people reported when comparing the GTX 1080 with GDDR5 with the GTX 1080 Ti with GDDR5X suggest that the root cause may be increased latencies in the DRAM which cannot be overcome completely by the latency tolerance designed into the GPU. So this would make it not strictly an either-or causation, more the result of a combination of design compromises.

As far as I understand, GDDR6 is just a refinement of GDDR5X, so fundamental trade-offs (presumably favorable bandwidth vs less favorable latency) have not changed. Another way to look at it is that we have reached the “coffin corner” of classical GDDR design. Since the increased bandwidth of GDDR6 benefits the majority of GPU use case, and it remains relatively cheap compared to next-generation memory technology like HBM2, I suspect this will be with us for quite some time, especially in the consumer space.

Moore’s Law is basically dead for processors, DRAM, and mass storage, so I wouldn’t expect any dramatic improvements from here on out. We have entered the world of incremental refinements, like many other mature industries.

As a consequence, users should not simply treat a new GPU architecture as a “rising tide that lifts all boats”, but rather as a solution benefiting specific application profiles. Whether Pascal, Volta, or Turing is the most appropriate solution is best determined by benchmarking one’s specific use case(s), then making decisions based on the outcome.

yangruiheng1 · October 18, 2018, 7:51am

Yes, it seem like 2080 isn’t better than 1080 in all cases. But I have found some problems may come from cudnn rather than 2080.

In the graphs I have put, 2080 perform poorly in convolution layer. And I use nvprof to check up the progresses running in GPU, I found when I specify input as:
image size [1, 8, 512, 512]
filter size [32, 8, 3, 3]
pad [1, 1]
convstride[1, 1]
The kernel running is :
1080 FP32: maxwell_scudnn_128x32_relu_small_nn 334.86us
2080 FP32: volta_scudnn_128x32_relu_small_nn_v1 237.28us
2080 FP16 tensor core: turing_fp16_s1688cudnn_fp16_256x64_ldg8_relu_f2f_exp_small_nhwc_tn_v1 416.75us

2080 FP32 is quicker than 1080 FP32 at this time but 2080 FP16 tensor core slow down. It’s interesting that 1080 and 2080 kernel has 128x32 in their name but tensor core has 256x64.

I guess nvidia should do some optimization on cudnn.

Jimmy_Pettersson · October 18, 2018, 8:48am

@njuffa, the latency argument makes a lot of sense.

LukeCuda · October 25, 2018, 9:54am

I was trying very hard to find out why 2080Ti tensor cores were half as fast as Titan V tensor cores.

The reason is that they can only do FP32 accumulate at half speed. Titan V tensors and infact Quadro RTX tensors(!!) do full speed.

So they did gimp the tensor cores for the consumer models of RTX.

tera mentioned it here: [url]https://devtalk.nvidia.com/default/topic/1042948/cuda-programming-and-performance/2080ti-vs-titan-v/post/5292507/#5292507[/url] but I didnt notice. Nvidia has documented this behaviour in their whitepaper

yangruiheng1 · October 25, 2018, 12:50pm

Thank you! This document helps me a lot!

Topic		Replies	Views
2080ti vs Titan V CUDA Programming and Performance	16	5614	October 25, 2018
GTX980ti faster than RTX 2080ti? CUDA Programming and Performance	12	524	August 19, 2020
Titan RTX and Titan V CUDA Programming and Performance	18	12797	August 11, 2019
Lots of small matrices CUDA Programming and Performance	27	73334	September 23, 2011
Student buying card for CUDA. Which one? CUDA Programming and Performance	16	14861	December 4, 2012
Disappointed performance using C2050 CUDA Programming and Performance	20	7738	September 2, 2010
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11162	May 23, 2010
GTX480 performance on different motherboards performance differs on AMD and INTEL motherboards CUDA Programming and Performance	15	18379	June 7, 2010
Is GPU worth it? GPU currently too slow. CUDA Programming and Performance	16	6040	December 8, 2008
Titan X (with latest drivers) slower than Titan Black with older drivers CUDA Programming and Performance	45	10849	October 13, 2015

Is GeForce RTX 2080 slower than GeForce GTX 1080 on small matrix-matrix multiplication?

Related topics