Speed Up Calculation

Hi Guys!

How do I calculate Speed Up with CUDA? (I’m using a GeForce GTX 680 card with 1536 cuda cores)

In the old days, when only normal processor existed (no CUDA cores) the relation was:

Speed Up=(Time of the best sequential algorithm to solve problem X)/(Time for p processors to solve problem X in parallel)

So, if a problem was solved sequentially in 100 seconds, and the same problem was solved with 2 processors in 50 seconds, this means that the speed up is 2. When the Speed Ups is equal to number of processors used this means is theoretically optimal (never the case in real life because of many factors).

Speed Up with 2 processors = 100/50 = 2

Now with 1536 CUDA cores available on GTX 680, my old professor expects a speed up of 1536x, The speed up for my algorithm is only 30x (a waste of resources to his eyes)

What would be your answer for an old supercomputer professor? Is it only (sequential algorithm / parallel algorithm)?

Thanks in advance!

For one there is a difference in clock speed for those cores… more cores, reduced clock speed.

One metric of comparision which is commonly used is :


So for example the GTX 680 single precision 2.5 Teraflops, while the GTX 780 has 4.0 Teraflops.



30 times is still something. What would have taken a month now it takes only a day. So far the only physics application which has a 1000 speed up is the Monte Carlo algorithm for spins. Also you could get a 2 speed up or more if you can change from double to floats.
Even in the old times the speed up not double because of communication between cpus. It was considered a good scaling if you got 1.5 for doubling the cpus.

On a very compute bound DP algorithm with sufficient data to keep the GPU busy and overclocked GTX Titan, I have achieved 58x over an 8-threaded multicore OpenMP based C code. If I use floats instead of double computations for a specific sincos calculation in a kernel (loss of accuracy was very minimal in results) I have gotten up to 72x speedup. As pasoleatis mentioned, a 30x speedup is pretty decent.

A variation of the same algorithm (with less data to keep the GPU completely busy) I just sent to a conference paper has a 46x speedup over a single-core version, or 6x over a 12-core multi-core implementation. You also have to keep in mind that cores on a modern CPU are running an average of 3x faster than GPU cores give or take (e.g. 3.4 GHz for an i7-4930k, vs 0.9 GHz for a Quadro K6000)

Also, if you’re running a lot of DP arithmetic on a GTX 680, it’s going to be painfully slow because that card does DP artitmetic at 1/24th speed of SP arithmetic. In that case, you should be benchmarking on a GTX Titan, or a Tesla C2070/K20/K40 to get the better performance of the full DP capabilities of these cards. (i.e. 1/2 or 1/3 of SP FLOPS)

If your problem is memory bound, then the maximum speed-up you can achieve is the ratio of memory bandwidth of the GPU vs. the CPU. This would be in the order of 10x to 20x typically. All of this can be found in data sheets for your mainboard/CPU and for your particular GPU.

If your problem is entirely compute bound, then your maximum speed up ratio can be as much as the ratio of peak GFlops of the GPU vs. the ratio of peak GFlops on the CPU. These numbers can be looked up in data sheets and on web pages.

In reality the speed-up is somewhere inbetween, sometimes also a even lower due to additional overhead that you incur (irregularities in memory access patterns intrinsic to the algorithm… or heavy branching in the algorithm which causes divergence on the GPU)

Forget about the core count ratio. Your professor is uninformed.

Thank you all for the info!

So, if I understand correctly, generally speaking, GFLOPS is the unit to look for, BUT is not as simple as that, it depends deeply on the algorithm and the resources that requires (Memory, Cores, a Good design on the architecture, etc).

Also checking for Single Point (SP) and Double Point (DP) operations is a factor to consider depending on the board.

Thanks for sharing the speed up on your works also! (I felt empathy in my hearth :P)

Honorable mention to “cbuchner1” for his explanation


Huh if the 770 is JUST an OverClocked 680, then why is it listed as 3.2 Tflops everywhere I look?


I was HOPING to be able to figure out the performance in Tflops since I have it OverClocked to 1280 core and the VRAM running at 7.9 GHz effective.

Also I feel I should mention that my 770 is special - not kidding. It has the all metal Titan cooler! Again, from what I have read, there were only about 500 to 1000 of those sold in the USA!

Even the Anandtech article mentions this on their overclock section, saying that their number may not be a good representative as they have an engineering sample AND that “770s will not be sold with the Titan cooler” to paraphrase.

Ahh here we go, apparently the 680 at stock speeds is listed by TechPowerUp as 3.09 Tflops.

Still wanna know what my 770 at 1280 is doing!


I found an aftermarket card that is ALMOST at the level mine is!

so I am peaking just a tad over 3.7 Tflops! Not to mention that JUICY 254 GB/sec of VRAM bandwidth I have!

So seriously, at 1080p I really have no need to upgrade. I WILL be waiting to see where both Pascal and Polaris land first. If I were to bet, my money would probably be on Nvidia though!