Performance of GF10x GPU


I’m having trouble to find the DGEMM performance of GF104/106/108.

My question if simple: i got ~55% of peak performance on my CARMA devkit whereas i got ~80% of peak on my a Tesla C2050 or on my GeForce480GTX.

I understand that the GF10x architecture is a little bit different of GF100 (48core/SM vs 32core/SM,…) so I want to know if this difference comes from the architecture of from the ARM-GPU configuration.

Anyone can tell me what DGEMM performance he has on GF104/106/108 ?


I get over 90% on GF108 in both cublasDgemm( ‘N’, ‘N’, … ) and cublasDgemm( ‘N’, ‘T’, … ) for matrices over 1024x1024.

More specifically, I get ~23.5 Gflop/s on GT 440, which has 2 SMs and runs at 1.62 GHz. 48 cores don’t matter in DGEMM as all action happens in double precision units, there are 4 of them per SM.

Sorry my mistake, I was talking about CGEMM and not DGEMM (or SGEMM, but i use CGEMM because i’m using complex numbers for my signal processing apps).

I have pretty much the same performance than you on DGEMM.

I also get 55% in CGEMM on GT440.

More specifically, this is what I have on a Quadro 1000M (NVIDIA CARMA):

Shall we expect 2.5x more performances on a GK107/117 GPU ?

Also I have also some differences between my GF100 and GF108 when using streams to batch multiple kernels, I will share it asap.

I don’t know about GK107, but on GK104 (GTX680) I get up to ~1600 Gflop/s in CGEMM, which is 46% of peak. I get about the same number on GK110 (K20c). SGEMM, for comparison, is almost 2x faster on GK110.

Such a big difference between CGEMM and SGEMM? Register banking effect?
I see why you’d like to have a Kepler assembler…

Thanks for your numbers. I also have 1600 GFlop/s on my K20c

Tim, I think the reason might be more mundane - they possibly didn’t invest as much effort in rewriting CGEMM for sm35 as they did with SGEMM.