SGEMM performance of current Kepler GPUs?

I’m trying to find some SGEMM benchmark results for current Kepler GPUs and without the PCIe overhead, kind of like this one:

but I guess I don’t know the right keywords to Google. Any suggestions?

This may be of interest, not sure it has exactly what you’re looking for:


That’s precisely what I wanted. Thanks!

Another interesting data point is with the newly optimized sgemm cublass lib for Maxwell in cuda 6.5. I’m getting >1500 Gflops on a 750 Ti (N=1280). That’s 1/5 the price for half the performance of a 780 Ti Kepler (and substantially less power).

That is interesting. My guess is that the performance gap will widen though, if you either increase N or do several matrix multiplications at the same time?

Also, if you price the whole system: 1 CPU + 2 GPUs, the price ratio will be about 2:1 (more if you don’t need server-grade RAM, less if you insist on Xeon CPUs)

… I’d be curious to see some test results though.

Make that 1600 Gflops at N=4096 and that’s with queuing up 100 of these kernels.

512: 1200
768: 1250
896: 1370
1024: 1458
1152: 1502
1280: 1587
4096: 1600

The new kernel is entirely compute bound, provided you feed it enough data so the inner loop gets enough time to run to wash out initialization and finalizing costs. The inner loop has 90% FFMA instructions of which almost 0% have any execution dependency (not even register bank conflicts). So it basically runs at >90% of the theoretical flops (as it can dual issue the memory loads/stores along with the ffmas). Kepler by contrast can really only make effective use of 2/3 of it’s cuda cores so a lot of the chip is wasted. Maxwell has 128 cores per SM verses Kepler’s 192 for very good reason.

If you’re going to be using the cublasXt API I can have a peek at that sass code to see if there’s Maxwell code for that yet…


Wikipedia says the peak SP performance for GTX 750 Ti is 1306 GFLOPS, which is lower than your actual performance.

Do you use 2 N^3 as your definition of the number of operations in a matrix product?

Is your card overclocked?

What do you use as your theoretical limit estimate?

I’m using the default EVGA clocks which are overclocked out of the box. These scores are from running at a boost of 1320 MHz. It seems the newest nVidia drivers seem to ignore factory settings and dynamically run the card based on its own metrics.

The given number in that table is calculated from the 1020 base clock:

(flops per ffma) * (cuda cores) * (GHz) = Gflops

2 * 640 * 1.020 = 1306

So with my clock:

2 * 640 * 1.320 = 1670

So at N=4069 I’ve seen 1605 GFlops which is 95% efficient. And it can stay at that boost clock for over a minute calculating these matrices. Haven’t tested it for longer but I wouldn’t be surprised if these newest drivers let it stay at those levels indefinitely. The temp remains steady at 55C (and the fan remains inaudible).

Oh and yes, I’m calculating actual flops with 2*N^3 and using cudaEventRecord before and after 100 kernels are called, although all using the same arrays, but I don’t think that would effect anything. Looking at the the Nsight numbers this kernel isn’t remotely starved of device memory bandwidth.

Thanks for the clarification!

I’m wondering, can you install GTX 750 Ti’s right next to each other?

It’s dual-width – what if these cards are adjacent to each other with no additional spacing? Will the airflow be obstructed, or is the card’s primary intake on the shortest side, opposite the HDMI port?

I have the dual fan version of the card (which is probably overkill). Intake is pulled with the fans onto the heatsink. The card is so low power it’s hard to imagine slightly obstructing airflow would effect it’s ability to stay cool very much.

Oh and to be more specific about device memory usage: At N=4096 it’s using 32 GB/s of the available 82.4 GB/s of device bandwidth or 39%. So you could likely run two concurrent streams without suffering any overall slowdown.

Another motivation for getting Maxwells is that nVidia will likely soon be releasing newer versions of these cards with substantially more SM’s on board. You’ll feel substantially less guilty about upgrading if only spending 140 a piece now. Plus you get to write and optimize your code for the newest architecture, which is why I’m using this for development.

Actually, for each stream you add, total bandwidth probably wouldn’t be much effected as each kernel would only be pulling the proportion of data for the compute resources it has allocated. So never mind that comment about only two streams.

Thanks for the advice. Rumor web sites mention Q4 of this year, and also that GM1xx will be dropped. Perhaps there are problems with it? Or maybe they found a way to improve the architecture.

Latest rumors I’m seeing are that the GM204 will be ready in Oct/Nov time frame. It should basically be a GM107 chip with 3-4 times as many cores and maybe some additional L2 cache. It will be a 28nm chip (same as Kepler). Looks like nVidia will be be skipping the 20nm process and going right to 16nm in 2015. With those specs it should hit 4.8 - 6.4 Gflops in sgemm.

Sorry, Tflops not Gflops.