my speedy SGEMM

I’ve tried implementing a matrix multiply that’s as fast as I could make it. On my 8600GT OC, it gets 38 Glfops vs 23 with CUBLAS. Could somebody who has an 8800GTX get some numbers?

I tried very hard to optimize it, and experimented with many techniques. I’m afraid, however, that further improvement is unlikely without direct access to cubin. Minor changes give rather chaotic swings in performance (up-down by 30-200%), even with ptxas run at -O0. Turning on ptxas optimizations usually hurts performance.

  1. Nice Alex!


I get 137.9 Gflop/s on 8800 GTX. CUBLAS runs at 120.

I get 171.68 Gflops, using a 8800 GTX ULTRA.

Nice work!

wow! how can such a difference be explained?

see, something about the compiiling/optimizing is just so chaotic.

I doubt he even recompiled, probably used your binary.

This is a nice example how your excellent program scaled, without recompiling, across the 8-series family with different numbers of processors and different clocks.

The Ultra has faster shader, memory clocks and bandwidth than the GTX, so the 137 to 171 is just straight clock scaling.

172/138 = 25%, while the ultra has 6.5% faster shaders and 20% memory. something else must be going on.

btw, the code gets recompiled automatically because the executable only has cm_10 ptx and sm_11 cubin embedded (it’s an artifact of my optimizing).

I’ve also got the Ultra.

When I don’t recompile the project (just launch the binary in the Release folder) I get 155 GFlops. Recompiling the Project in Release Mode gives me the same number.

What’s going on, why is my card slower?
I use it not only for the computation but it’s also my display adapter so it also renders my desktop. Might this impact performance?

Btw, nice work Alex!

So, this code is 65% faster than CUBLAS when run on 8600GT OC, but only 14% faster on 8800 GTX. I wonder how it compares to CUBLAS on 8800 GTX Ultra. Any info?

Also, I am not sure that the timing used in this code is fair. I’d call cudaThreadSynchronize() after each kernel invocation, inside the loop. If you invoke kernel again before previous pass is not over, it may go wrong. This could be the cause for the 171 vs 155 case.

Calling cudaThreadSynchronize() in CUDA 1.0 after each kernel call isn’t necessary. Subsequent kernel calls get serialized by the driver. So, it suffices to call cudaThreadSynchronize() once, after the timing loop.


nice code optimization but what you have coded is not a real SGEMM.
SGEMM performs C=alphaAB+beta*C.

CUBLAS achieves 120Gflops in CUDA 1.0 for SGEMM and it will improve in the upcoming release.

This contradicts with my observations with CUDA 1.0.

To be more specific, I took the bandwidth benchmark posted in another thread (, and inserted cutGetTimerValue() in the timing loop to measure the time taken by each iteration. First ~20 iterations take ~13 us each which is just the kernel invocation overhead, later iterations take ~2.9 ms each which is the expected time and is the same as when using cudaThreadSynchronize().

There seems to be a misunderstanding. When you want to time each kernel run you have to make sure the kernel finished. You do that by calling cudaThreadSynchronize().

What I think Paulius is saying is that you can call say 10 kernels without the cudaThreadSynchronize and CUDA will queue the calls and execute them one by one in a FIFO manner (I guess). You don’t have to call cudaThreadSynchronize() between each kernel invocation to make sure it is executed properly etc. You just have to call it once right before you take the time to make sure everything is finished. That’s if you time more kernels per run.

So you are right: basically I think one can say before taking the time there should be a call to cudaThreadSynchronize(). I think it would actually make sense to implicitly invoke cudaThreadSynchronize() functionality when calling cutGetTimerValue(). So no one would get confused. A lot of people have measured wrong times because of the obscure asynchronous behavior of CUDA, including myself.

OK, i’ve got it. They are queued. And in my case I get 10 kernel calls in a queue.

I though Paulius meant that there is an implicit call to cudaThreadSynchronize() before each kernel invocation.


You’re right. A real sgemm includes alpha and beta, and supports various transpose modes. Alphas and betas are trivial, but the transpose modes will require more work. In a proper implementation, however, the extra features will have little effect on performance (especially when they’re not used).

I’ll work on expanding my function to be a full sgemm.

I have an innocent question about SGEMM. You mention 120GFlops. Is this sustained or is there a performance hit when the constants are changed ?

Obviously there will be a drop in performance for a short time but what I’m interested in is aggregate computation speed (ie over many operations) and I’m trying to ascertain how often I could change the constants on-the-fly without a major penalty (cache flushes etc).


This is sustained on the card for any alpha and beta.

We ran a CUBLAS/sgemm test on an 8500GT and get about 5gflops which is less than the 8 gflops that intel MKL provides on the cpu (3gHz duo core).

We got this rather low-end card just to try out cublas and see if it was feasible for using instead of/in addition to MKL blas in our application. From these tests it would not be fruitful. Can we expect that stepping up to something like an 8800 will provide significant gains? Is it probably that the gap between cublas performance and mkl blas will widen in future cublas releases and future nvidia card generations?

Our users are not graphics types so we can’t expect for our prospects to typically have 8800 level and beyond boards for some time so we are still trying to evaluate if perhaps it is too soon to supplement routines such as sggemm/fft in our applications with cuda equivalents.

Beau Paisley

The 8500GT has 16 processors clocked at 900Mhz and a 128bit memory interface. I am impressed that it is achieving 8 Gflops… :-)

If you are looking for a good entry level card (around $250), try the 8800GT. It has 112 processors clocked at 1.5Ghz and a 256bit memory interface.

Could you tell us what was the size of the matrices?

I am trying to run the code on a Quadro FX 5600 but under Linux and I am not sure how to compile it.
I used the standard Makefile that was provided for the matrixMul example but the performance I
get is max ~110 Gflop/s (with CUBLAS I get 120 Gflop/s). Can somebody help with this. Thanks,