Speed Up Calculation

ElGuapo_Oficial · January 23, 2014, 4:42am

Hi Guys!

How do I calculate Speed Up with CUDA? (I’m using a GeForce GTX 680 card with 1536 cuda cores)

In the old days, when only normal processor existed (no CUDA cores) the relation was:

Speed Up=(Time of the best sequential algorithm to solve problem X)/(Time for p processors to solve problem X in parallel)

So, if a problem was solved sequentially in 100 seconds, and the same problem was solved with 2 processors in 50 seconds, this means that the speed up is 2. When the Speed Ups is equal to number of processors used this means is theoretically optimal (never the case in real life because of many factors).

Speed Up with 2 processors = 100/50 = 2

Now with 1536 CUDA cores available on GTX 680, my old professor expects a speed up of 1536x, The speed up for my algorithm is only 30x (a waste of resources to his eyes)

What would be your answer for an old supercomputer professor? Is it only (sequential algorithm / parallel algorithm)?

Thanks in advance!

CudaaduC · January 23, 2014, 4:57am

For one there is a difference in clock speed for those cores… more cores, reduced clock speed.

One metric of comparision which is commonly used is :

[url]http://en.wikipedia.org/wiki/FLOPS[/url]

So for example the GTX 680 single precision 2.5 Teraflops, while the GTX 780 has 4.0 Teraflops.

[url]404 - LanOC Reviews

pasoleatis · January 23, 2014, 9:08am

Hello,

30 times is still something. What would have taken a month now it takes only a day. So far the only physics application which has a 1000 speed up is the Monte Carlo algorithm for spins. Also you could get a 2 speed up or more if you can change from double to floats.
Even in the old times the speed up not double because of communication between cpus. It was considered a good scaling if you got 1.5 for doubling the cpus.

vacaloca · January 23, 2014, 2:32pm

On a very compute bound DP algorithm with sufficient data to keep the GPU busy and overclocked GTX Titan, I have achieved 58x over an 8-threaded multicore OpenMP based C code. If I use floats instead of double computations for a specific sincos calculation in a kernel (loss of accuracy was very minimal in results) I have gotten up to 72x speedup. As pasoleatis mentioned, a 30x speedup is pretty decent.

A variation of the same algorithm (with less data to keep the GPU completely busy) I just sent to a conference paper has a 46x speedup over a single-core version, or 6x over a 12-core multi-core implementation. You also have to keep in mind that cores on a modern CPU are running an average of 3x faster than GPU cores give or take (e.g. 3.4 GHz for an i7-4930k, vs 0.9 GHz for a Quadro K6000)

Also, if you’re running a lot of DP arithmetic on a GTX 680, it’s going to be painfully slow because that card does DP artitmetic at 1/24th speed of SP arithmetic. In that case, you should be benchmarking on a GTX Titan, or a Tesla C2070/K20/K40 to get the better performance of the full DP capabilities of these cards. (i.e. 1/2 or 1/3 of SP FLOPS)

cbuchner1 · January 23, 2014, 3:07pm

If your problem is memory bound, then the maximum speed-up you can achieve is the ratio of memory bandwidth of the GPU vs. the CPU. This would be in the order of 10x to 20x typically. All of this can be found in data sheets for your mainboard/CPU and for your particular GPU.

If your problem is entirely compute bound, then your maximum speed up ratio can be as much as the ratio of peak GFlops of the GPU vs. the ratio of peak GFlops on the CPU. These numbers can be looked up in data sheets and on web pages.

In reality the speed-up is somewhere inbetween, sometimes also a even lower due to additional overhead that you incur (irregularities in memory access patterns intrinsic to the algorithm… or heavy branching in the algorithm which causes divergence on the GPU)

Forget about the core count ratio. Your professor is uninformed.

ElGuapo_Oficial · January 23, 2014, 7:24pm

Thank you all for the info!

So, if I understand correctly, generally speaking, GFLOPS is the unit to look for, BUT is not as simple as that, it depends deeply on the algorithm and the resources that requires (Memory, Cores, a Good design on the architecture, etc).

Also checking for Single Point (SP) and Double Point (DP) operations is a factor to consider depending on the board.

Thanks for sharing the speed up on your works also! (I felt empathy in my hearth :P)

Honorable mention to “cbuchner1” for his explanation

Cheers!

DysphoricSmile · April 7, 2016, 2:47pm

Huh if the 770 is JUST an OverClocked 680, then why is it listed as 3.2 Tflops everywhere I look?

I was HOPING to be able to figure out the performance in Tflops since I have it OverClocked to 1280 core and the VRAM running at 7.9 GHz effective.

Also I feel I should mention that my 770 is special - not kidding. It has the all metal Titan cooler! Again, from what I have read, there were only about 500 to 1000 of those sold in the USA!

Even the Anandtech article mentions this on their overclock section, saying that their number may not be a good representative as they have an engineering sample AND that “770s will not be sold with the Titan cooler” to paraphrase.

DysphoricSmile · April 7, 2016, 2:51pm

Ahh here we go, apparently the 680 at stock speeds is listed by TechPowerUp as 3.09 Tflops.

Still wanna know what my 770 at 1280 is doing!

DysphoricSmile · April 7, 2016, 2:57pm

HAHAHA!

I found an aftermarket card that is ALMOST at the level mine is!

so I am peaking just a tad over 3.7 Tflops! Not to mention that JUICY 254 GB/sec of VRAM bandwidth I have!

So seriously, at 1080p I really have no need to upgrade. I WILL be waiting to see where both Pascal and Polaris land first. If I were to bet, my money would probably be on Nvidia though!

Topic		Replies	Views
Hardware comparison CUDA Programming and Performance	3	1394	January 23, 2014
Speed Up Calculations CUDA Programming and Performance	5	7435	January 10, 2011
Can speed up ratio greater than the number of GPU processor cores? CUDA Programming and Performance	11	2925	June 3, 2010
speed up, S> no. of core ? is it possible ? CUDA Programming and Performance	5	3815	October 5, 2009
What is maximum speed-up that can be obtained with GPU? CUDA Programming and Performance	6	12634	June 24, 2016
2600x speedup for GA? Is this fake? CUDA Programming and Performance	19	15190	March 12, 2010
Characterization of the Speed Up on GPGPGU. 400X speed up on a Molecular Dynamics Application. CUDA Programming and Performance	5	1571	December 8, 2009
Theoretical FLOP speed Need clarification(s) CUDA Programming and Performance	8	28521	March 19, 2009
Theoretical maximum speedup factor CUDA Programming and Performance	2	1214	September 23, 2013
How Calculate Speed Up with CUDA CUDA Programming and Performance	3	3231	April 11, 2015

Speed Up Calculation

Related topics