Performance of GF10x GPU

Nostra · April 18, 2013, 12:32pm

Hello,

I’m having trouble to find the DGEMM performance of GF104/106/108.

My question if simple: i got ~55% of peak performance on my CARMA devkit whereas i got ~80% of peak on my a Tesla C2050 or on my GeForce480GTX.

I understand that the GF10x architecture is a little bit different of GF100 (48core/SM vs 32core/SM,…) so I want to know if this difference comes from the architecture of from the ARM-GPU configuration.

Anyone can tell me what DGEMM performance he has on GF104/106/108 ?

Thanks!

vvolkov · April 18, 2013, 1:04pm

I get over 90% on GF108 in both cublasDgemm( ‘N’, ‘N’, … ) and cublasDgemm( ‘N’, ‘T’, … ) for matrices over 1024x1024.

More specifically, I get ~23.5 Gflop/s on GT 440, which has 2 SMs and runs at 1.62 GHz. 48 cores don’t matter in DGEMM as all action happens in double precision units, there are 4 of them per SM.

Nostra · April 20, 2013, 12:33pm

Sorry my mistake, I was talking about CGEMM and not DGEMM (or SGEMM, but i use CGEMM because i’m using complex numbers for my signal processing apps).

I have pretty much the same performance than you on DGEMM.

vvolkov · April 21, 2013, 4:59am

I also get 55% in CGEMM on GT440.

Nostra · April 23, 2013, 9:33pm

More specifically, this is what I have on a Quadro 1000M (NVIDIA CARMA):

Shall we expect 2.5x more performances on a GK107/117 GPU ?

Also I have also some differences between my GF100 and GF108 when using streams to batch multiple kernels, I will share it asap.

vvolkov · April 24, 2013, 2:10am

I don’t know about GK107, but on GK104 (GTX680) I get up to ~1600 Gflop/s in CGEMM, which is 46% of peak. I get about the same number on GK110 (K20c). SGEMM, for comparison, is almost 2x faster on GK110.

tera · April 24, 2013, 10:38am

Such a big difference between CGEMM and SGEMM? Register banking effect?
I see why you’d like to have a Kepler assembler…

Nostra · April 24, 2013, 11:42am

Thanks for your numbers. I also have 1600 GFlop/s on my K20c
[url]http://jf.degurse.free.fr/images/benchmarks/BLAS/gflops_GPU_K20.png[/url]

vvolkov · April 24, 2013, 12:00pm

Tim, I think the reason might be more mundane - they possibly didn’t invest as much effort in rewriting CGEMM for sm35 as they did with SGEMM.

Topic		Replies	Views
How to disable/enable ECC on C2050? CUDA Programming and Performance	22	14019	April 24, 2010
CUDA lib performance on Ampere architecture CUDA Programming and Performance	2	814	April 22, 2021
SGEMM FP16 compute? CUDA Programming and Performance	6	3828	December 4, 2016
SGEMM performance of current Kepler GPUs? CUDA Programming and Performance	14	4734	July 25, 2014
speedy CGEMM reaches 448 Gflop/s CUDA Programming and Performance	1	2758	March 22, 2010
Double-precision on GTX 280 and coming telsa S1070 CUDA Programming and Performance	11	21576	August 22, 2008
Reasonable timing with Cublas dgemm and sgemm CUDA Programming and Performance	15	4232	January 14, 2010
CUBLAS Performance Many algorithms perform abysmally CUDA Programming and Performance	6	7599	February 3, 2008
cublas sgemm,dgemm performance issue on telsa 10 and gtx 570 GPU-Accelerated Libraries	1	1291	February 24, 2013
why the Tesla T4 peak performance test result mismatch with the official doc CUDA Programming and Performance	8	2473	October 19, 2019

Performance of GF10x GPU

Related topics