Optimising Lennard Jones


I have implemented a very naive calculation of the Lennard Jones potential for a molecular system, and would like your help in optimising it. I have read the “Fast N-body Simulation” article several times, but there are difficulties with taking its approach, primarily that I cannot store the position or force vectors in float4 structures, since I need to do some rather involved matrix manipulation (Cholesky square root via the MAGMA library and various multiply and linear-solve operations).

At present I am using the most inefficient calculation method possible - a single thread executing on a single multiprocessor, loading from and writing to global memory. Clearly I’m taking no advantage of parallelism in this implementation, which is something I need to resolve.

I have attached my kernel file to this message, if you are able to give me some advice based on your experience, I would be eternally grateful.
lj.cu (1.03 KB)

Just for information, I’m using CUDA 4.0, with a GTX570 on my desktop machine and the latest Tesla on my university’s server farm.

Previously the LJ calculation was 66% of my total GPU runtime. It is now 0.98%. The steps I took were as follows:

  1. Removed the outer loop from the code, and ran one thread per atom i, calculating its interactions with all other atoms i!=j.
  2. Removed the calculation of the interactions i->j and j->i in the same loop, and instead explicitly calculated each interaction. This boiled down to removing the lines where a force was subtracted from the d_F_j variable.
  3. Since the variable float* potential was contended amongst threads, I created an array of N floats, in to which the potential due to each atom i could be stored. Once the LJ kernel had completed, I quickly summed up these individual potentials to give the final potential of the system.

All in all, rather a good result, especially since I didn’t have to get involved in shared memory, float4 hacks and the like.