I’m having trouble to find the DGEMM performance of GF104/106/108.
My question if simple: i got ~55% of peak performance on my CARMA devkit whereas i got ~80% of peak on my a Tesla C2050 or on my GeForce480GTX.
I understand that the GF10x architecture is a little bit different of GF100 (48core/SM vs 32core/SM,…) so I want to know if this difference comes from the architecture of from the ARM-GPU configuration.
Anyone can tell me what DGEMM performance he has on GF104/106/108 ?
I get over 90% on GF108 in both cublasDgemm( ‘N’, ‘N’, … ) and cublasDgemm( ‘N’, ‘T’, … ) for matrices over 1024x1024.
More specifically, I get ~23.5 Gflop/s on GT 440, which has 2 SMs and runs at 1.62 GHz. 48 cores don’t matter in DGEMM as all action happens in double precision units, there are 4 of them per SM.
I don’t know about GK107, but on GK104 (GTX680) I get up to ~1600 Gflop/s in CGEMM, which is 46% of peak. I get about the same number on GK110 (K20c). SGEMM, for comparison, is almost 2x faster on GK110.