Characterization of the Speed Up on GPGPGU. 400X speed up on a Molecular Dynamics Application.

mayank · December 1, 2009, 6:34am

Hey Folks,

I have tried to port an open source MD application called GEM on the GPU and have been able to achieve almost 400 folds speed up on GTX 285.
Now, I really want to understand as to from where such humongous speed up is coming. The device has 240 cores, thus the speed up should be around 240 times but how can I explain the reason for it.

I have not used any special optimization other than that mentioned in the Best Practices Guide; constant memory, shared memory, fast_math, coalesced accesses (GTX 280 takes care of this).

The structure of the application is such that it maps very well onto the GPU but I want to understand what is really happening “behind the scenes”.
The application is memory bound, if it helps.

Where should I start. What should I look at. Please provide some insight.

avidday · December 1, 2009, 7:41am

I would have thought that is a pretty bad assumption to make - that would imply (amongst other things) that a single CPU core is the same speed as a single GPU core, which clearly isn’t the case.

I guess the first question is are you really sure you measurements are correct? If both the gpu and cpu version of your application are memory bandwidth bound, then it would seem to be logical that that upper bound on speed up between the two versions should be the ratio of memory bandwidth between the benchmark cpu and gpu. For the hardware I am familiar with, that number is closer to 10 times than 100 times. So if you are getting 400x speed up then I would assume one of two things must be happening: you code is really compute bound (and your cpu version is single threaded, very suboptimal and running on a slow cpu), or your performance measurements are in error.

mayank · December 1, 2009, 2:24pm

I would have thought that is a pretty bad assumption to make - that would imply (amongst other things) that a single CPU core is the same speed as a single GPU core, which clearly isn’t the case.

I guess the first question is are you really sure you measurements are correct? If both the gpu and cpu version of your application are memory bandwidth bound, then it would seem to be logical that that upper bound on speed up between the two versions should be the ratio of memory bandwidth between the benchmark cpu and gpu. For the hardware I am familiar with, that number is closer to 10 times than 100 times. So if you are getting 400x speed up then I would assume one of two things must be happening: you code is really compute bound (and your cpu version is single threaded, very suboptimal and running on a slow cpu), or your performance measurements are in error.

Thanks Avidday for the inputs. I would probe further into the issue. All I can say at this point is that my performance results are correct.

And yes the CPU version is single threaded. Should I multithread it for the purpose of comparison? How do I infer that if the code is compute bound on the CPU, then those results are possible.

eelsen · December 1, 2009, 6:34pm

Hey Folks,

I have tried to port an open source MD application called GEM on the GPU and have been able to achieve almost 400 folds speed up on GTX 285.

Now, I really want to understand as to from where such humongous speed up is coming. The device has 240 cores, thus the speed up should be around 240 times but how can I explain the reason for it.

I have not used any special optimization other than that mentioned in the Best Practices Guide; constant memory, shared memory, fast_math, coalesced accesses (GTX 280 takes care of this).

The structure of the application is such that it maps very well onto the GPU but I want to understand what is really happening “behind the scenes”.

The application is memory bound, if it helps.

Where should I start. What should I look at. Please provide some insight.

Generally a speedup number that large means you are comparing against crappy cpu code. Does it use SSE? Cache blocking? How fast is it compared to other MD codes? I can’t find any reference to this code with google…

Sarnath · December 2, 2009, 4:44am

Interesting… we are also looking @ a similar speedup in our case… but that is due to BAD BAD BAD CPU CODING… THats all…

For making good use of CPU power, one has to do the following:

Avoid C++ overloading operators in critical path. The code looks innocent but might silently eat performance.
The most sensitive part needs to be written in assembly.
Ideally, one has to target a particular micro-architecture for BEST performance. But yeah… if one follows the rules below one can get reasonably good performance in all architectures.
Register Block- Try to make use of only register operands in the sensitive code. So that the CORE does NOT stall accessing memory
4.5) SSE - Borrwing from eelsen
Instruction scheduling - Re-order your code to make out of order execution easy. RAW depencies are kept away from each other.
Make sure that all the functional units in the CPU are used effectively. for example, an FP MUL and FP ADD opeation could go simulatenously on some intel architectures…
Cache Blocking - Make sure data access is localized and data in cache are used effectively before proceeding to other regions on memory
8 Prefetch - Use prefetch effectively. Note that recent Intel architectures fetch 2 cache lines (128 bytes on recent architectures) for every cache miss and prefcetch. Make sure you prefetches dont overlap with each other. Otherwise, it is just a waste of time to do it.
TLB blocking - Make sure DTLB and ITLB are blocked and u dont run into VA to PA translation latencies frequently. If you optimzied for localized cache access, this should get automatically taken care of. If you are following linked lists randomly, you might get into this problem*
Use multi-threading to make use of all the CPU cores in your system. OpenMP can be tricky. Try out Intel parallel studio and check if your application really scales… Intel has recently announced Parallel Universe wherein you can upload your application and they will give a detailed report on multi-core scaling ability and an analysis of your application that you can view in Intel Parallel Studio. The parallel universe portal is FREE. But Intel parallel studio is commercial. FYI

Note that On CUDA - Register blocking and Cache Blocking HAPPENS by the nature of the programming model - which is a BIG PLUS. It happens at C level. In Intel, you need to venture into assembly for performance… So, thats why even badly written CUDA apps generate good performance.

- My own line of thinking…

Best Regards,
Sarnath

mayank · December 8, 2009, 1:49am

I would have thought that is a pretty bad assumption to make - that would imply (amongst other things) that a single CPU core is the same speed as a single GPU core, which clearly isn’t the case.

I guess the first question is are you really sure you measurements are correct? If both the gpu and cpu version of your application are memory bandwidth bound, then it would seem to be logical that that upper bound on speed up between the two versions should be the ratio of memory bandwidth between the benchmark cpu and gpu. For the hardware I am familiar with, that number is closer to 10 times than 100 times. So if you are getting 400x speed up then I would assume one of two things must be happening: you code is really compute bound (and your cpu version is single threaded, very suboptimal and running on a slow cpu), or your performance measurements are in error.

What would be the proper way to know memory bandwidth both on CPU and GPU that my program is using. I am allocating around 250 MB of memory. For GPU, cudaProf gives me Global memory bandwidth, is that the actual value that the program is attaining?

Topic		Replies	Views
300x to 600x times faster... really? CUDA Programming and Performance	92	34413	February 8, 2010
2600x speedup for GA? Is this fake? CUDA Programming and Performance	19	14981	March 12, 2010
Phenomenal Speed-up! CUDA Programming and Performance	13	10628	November 13, 2009
Can speed up ratio greater than the number of GPU processor cores? CUDA Programming and Performance	11	2738	June 3, 2010
Is GPU worth it? GPU currently too slow. CUDA Programming and Performance	16	6040	December 8, 2008
What is maximum speed-up that can be obtained with GPU? CUDA Programming and Performance	6	12173	June 24, 2016
Intel paper: Debunking the 100X GPU vs. CPU myth CUDA Programming and Performance	36	25222	April 7, 2011
device speed vs. host speed Why is my device program so slow? CUDA Programming and Performance	8	7892	August 16, 2007
maximum flops? CUDA Programming and Performance	5	3260	June 15, 2009
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11161	May 23, 2010

Characterization of the Speed Up on GPGPGU. 400X speed up on a Molecular Dynamics Application.

Related topics