SGEMM performance of current Kepler GPUs?

alexgg · July 22, 2014, 10:45am

I’m trying to find some SGEMM benchmark results for current Kepler GPUs and without the PCIe overhead, kind of like this one:

but I guess I don’t know the right keywords to Google. Any suggestions?

Robert_Crovella · July 22, 2014, 2:06pm

This may be of interest, not sure it has exactly what you’re looking for:

[url]http://on-demand.gputechconf.com/gtc/2014/webinar/gtc-express-cuda6-performance-webinar.pdf[/url]

alexgg · July 22, 2014, 4:33pm

That’s precisely what I wanted. Thanks!

scottgray · July 23, 2014, 5:10am

Another interesting data point is with the newly optimized sgemm cublass lib for Maxwell in cuda 6.5. I’m getting >1500 Gflops on a 750 Ti (N=1280). That’s 1/5 the price for half the performance of a 780 Ti Kepler (and substantially less power).

alexgg · July 23, 2014, 6:59pm

That is interesting. My guess is that the performance gap will widen though, if you either increase N or do several matrix multiplications at the same time?

Also, if you price the whole system: 1 CPU + 2 GPUs, the price ratio will be about 2:1 (more if you don’t need server-grade RAM, less if you insist on Xeon CPUs)

alexgg · July 23, 2014, 8:40pm

… I’d be curious to see some test results though.

scottgray · July 23, 2014, 8:47pm

Make that 1600 Gflops at N=4096 and that’s with queuing up 100 of these kernels.

512: 1200
768: 1250
896: 1370
1024: 1458
1152: 1502
1280: 1587
4096: 1600

The new kernel is entirely compute bound, provided you feed it enough data so the inner loop gets enough time to run to wash out initialization and finalizing costs. The inner loop has 90% FFMA instructions of which almost 0% have any execution dependency (not even register bank conflicts). So it basically runs at >90% of the theoretical flops (as it can dual issue the memory loads/stores along with the ffmas). Kepler by contrast can really only make effective use of 2/3 of it’s cuda cores so a lot of the chip is wasted. Maxwell has 128 cores per SM verses Kepler’s 192 for very good reason.

If you’re going to be using the cublasXt API I can have a peek at that sass code to see if there’s Maxwell code for that yet…

alexgg · July 23, 2014, 9:25pm

Thanks!

Wikipedia says the peak SP performance for GTX 750 Ti is 1306 GFLOPS, which is lower than your actual performance.

Do you use 2 N^3 as your definition of the number of operations in a matrix product?

Is your card overclocked?

What do you use as your theoretical limit estimate?

scottgray · July 23, 2014, 10:28pm

I’m using the default EVGA clocks which are overclocked out of the box. These scores are from running at a boost of 1320 MHz. It seems the newest nVidia drivers seem to ignore factory settings and dynamically run the card based on its own metrics.

The given number in that table is calculated from the 1020 base clock:

(flops per ffma) * (cuda cores) * (GHz) = Gflops

2 * 640 * 1.020 = 1306

So with my clock:

2 * 640 * 1.320 = 1670

So at N=4069 I’ve seen 1605 GFlops which is 95% efficient. And it can stay at that boost clock for over a minute calculating these matrices. Haven’t tested it for longer but I wouldn’t be surprised if these newest drivers let it stay at those levels indefinitely. The temp remains steady at 55C (and the fan remains inaudible).

Oh and yes, I’m calculating actual flops with 2*N^3 and using cudaEventRecord before and after 100 kernels are called, although all using the same arrays, but I don’t think that would effect anything. Looking at the the Nsight numbers this kernel isn’t remotely starved of device memory bandwidth.

alexgg · July 24, 2014, 5:56am

Thanks for the clarification!

I’m wondering, can you install GTX 750 Ti’s right next to each other?

It’s dual-width – what if these cards are adjacent to each other with no additional spacing? Will the airflow be obstructed, or is the card’s primary intake on the shortest side, opposite the HDMI port?

scottgray · July 24, 2014, 6:38pm

I have the dual fan version of the card (which is probably overkill). Intake is pulled with the fans onto the heatsink. The card is so low power it’s hard to imagine slightly obstructing airflow would effect it’s ability to stay cool very much.

Oh and to be more specific about device memory usage: At N=4096 it’s using 32 GB/s of the available 82.4 GB/s of device bandwidth or 39%. So you could likely run two concurrent streams without suffering any overall slowdown.

Another motivation for getting Maxwells is that nVidia will likely soon be releasing newer versions of these cards with substantially more SM’s on board. You’ll feel substantially less guilty about upgrading if only spending 140 a piece now. Plus you get to write and optimize your code for the newest architecture, which is why I’m using this for development.

scottgray · July 24, 2014, 7:01pm

Actually, for each stream you add, total bandwidth probably wouldn’t be much effected as each kernel would only be pulling the proportion of data for the compute resources it has allocated. So never mind that comment about only two streams.

alexgg · July 25, 2014, 6:20am

Thanks for the advice. Rumor web sites mention Q4 of this year, and also that GM1xx will be dropped. Perhaps there are problems with it? Or maybe they found a way to improve the architecture.

scottgray · July 25, 2014, 4:38pm

Latest rumors I’m seeing are that the GM204 will be ready in Oct/Nov time frame. It should basically be a GM107 chip with 3-4 times as many cores and maybe some additional L2 cache. It will be a 28nm chip (same as Kepler). Looks like nVidia will be be skipping the 20nm process and going right to 16nm in 2015. With those specs it should hit 4.8 - 6.4 Gflops in sgemm.

scottgray · July 25, 2014, 4:42pm

Sorry, Tflops not Gflops.

Topic		Replies	Views
CUDA lib performance on Ampere architecture CUDA Programming and Performance	2	881	April 22, 2021
What's new in Maxwell 'sm_52' (GTX 9xx) ? CUDA Programming and Performance	69	27593	December 23, 2014
cublas sgemm,dgemm performance issue on telsa 10 and gtx 570 GPU-Accelerated Libraries	1	1321	February 24, 2013
Reasonable timing with Cublas dgemm and sgemm CUDA Programming and Performance	15	4433	January 14, 2010
my speedy SGEMM CUDA Programming and Performance	91	276633	May 29, 2013
KeplerAs is opensource now. CUDA Programming and Performance	1	924	December 1, 2016
Fast matrix-matrix multiplication (GEMM) for Fermi CUDA Programming and Performance	11	4307	August 9, 2010
speedy CGEMM reaches 448 Gflop/s CUDA Programming and Performance	1	2787	March 22, 2010
Cuda SGEMM same speed as APPLE veclibs ? CUDA Programming and Performance	8	10711	May 8, 2008
Low performance on SGEMV CUDA Programming and Performance	2	2296	June 22, 2007

SGEMM performance of current Kepler GPUs?

Related topics