Characterization of the Speed Up on GPGPGU. 400X speed up on a Molecular Dynamics Application.

Hey Folks,

I have tried to port an open source MD application called GEM on the GPU and have been able to achieve almost 400 folds speed up on GTX 285.
Now, I really want to understand as to from where such humongous speed up is coming. The device has 240 cores, thus the speed up should be around 240 times but how can I explain the reason for it.

I have not used any special optimization other than that mentioned in the Best Practices Guide; constant memory, shared memory, fast_math, coalesced accesses (GTX 280 takes care of this).

The structure of the application is such that it maps very well onto the GPU but I want to understand what is really happening “behind the scenes”.
The application is memory bound, if it helps.

Where should I start. What should I look at. Please provide some insight.

I would have thought that is a pretty bad assumption to make - that would imply (amongst other things) that a single CPU core is the same speed as a single GPU core, which clearly isn’t the case.

I guess the first question is are you really sure you measurements are correct? If both the gpu and cpu version of your application are memory bandwidth bound, then it would seem to be logical that that upper bound on speed up between the two versions should be the ratio of memory bandwidth between the benchmark cpu and gpu. For the hardware I am familiar with, that number is closer to 10 times than 100 times. So if you are getting 400x speed up then I would assume one of two things must be happening: you code is really compute bound (and your cpu version is single threaded, very suboptimal and running on a slow cpu), or your performance measurements are in error.

Thanks Avidday for the inputs. I would probe further into the issue. All I can say at this point is that my performance results are correct.

And yes the CPU version is single threaded. Should I multithread it for the purpose of comparison? How do I infer that if the code is compute bound on the CPU, then those results are possible.

Generally a speedup number that large means you are comparing against crappy cpu code. Does it use SSE? Cache blocking? How fast is it compared to other MD codes? I can’t find any reference to this code with google…

Interesting… we are also looking @ a similar speedup in our case… but that is due to BAD BAD BAD CPU CODING… THats all…

For making good use of CPU power, one has to do the following:

  1. Avoid C++ overloading operators in critical path. The code looks innocent but might silently eat performance.
  2. The most sensitive part needs to be written in assembly.
  3. Ideally, one has to target a particular micro-architecture for BEST performance. But yeah… if one follows the rules below one can get reasonably good performance in all architectures.
  4. Register Block- Try to make use of only register operands in the sensitive code. So that the CORE does NOT stall accessing memory
    4.5) SSE - Borrwing from eelsen
  5. Instruction scheduling - Re-order your code to make out of order execution easy. RAW depencies are kept away from each other.
  6. Make sure that all the functional units in the CPU are used effectively. for example, an FP MUL and FP ADD opeation could go simulatenously on some intel architectures…
  7. Cache Blocking - Make sure data access is localized and data in cache are used effectively before proceeding to other regions on memory
    8 Prefetch - Use prefetch effectively. Note that recent Intel architectures fetch 2 cache lines (128 bytes on recent architectures) for every cache miss and prefcetch. Make sure you prefetches dont overlap with each other. Otherwise, it is just a waste of time to do it.
  8. TLB blocking - Make sure DTLB and ITLB are blocked and u dont run into VA to PA translation latencies frequently. If you optimzied for localized cache access, this should get automatically taken care of. If you are following linked lists randomly, you might get into this problem*
  9. Use multi-threading to make use of all the CPU cores in your system. OpenMP can be tricky. Try out Intel parallel studio and check if your application really scales… Intel has recently announced Parallel Universe wherein you can upload your application and they will give a detailed report on multi-core scaling ability and an analysis of your application that you can view in Intel Parallel Studio. The parallel universe portal is FREE. But Intel parallel studio is commercial. FYI

Note that On CUDA - Register blocking and Cache Blocking HAPPENS by the nature of the programming model - which is a BIG PLUS. It happens at C level. In Intel, you need to venture into assembly for performance… So, thats why even badly written CUDA apps generate good performance.

    • My own line of thinking…

Best Regards,
Sarnath

What would be the proper way to know memory bandwidth both on CPU and GPU that my program is using. I am allocating around 250 MB of memory. For GPU, cudaProf gives me Global memory bandwidth, is that the actual value that the program is attaining?