I have tried to port an open source MD application called GEM on the GPU and have been able to achieve almost 400 folds speed up on GTX 285.
Now, I really want to understand as to from where such humongous speed up is coming. The device has 240 cores, thus the speed up should be around 240 times but how can I explain the reason for it.
I have not used any special optimization other than that mentioned in the Best Practices Guide; constant memory, shared memory, fast_math, coalesced accesses (GTX 280 takes care of this).
The structure of the application is such that it maps very well onto the GPU but I want to understand what is really happening “behind the scenes”.
The application is memory bound, if it helps.
Where should I start. What should I look at. Please provide some insight.