I have implemented a very naive calculation of the Lennard Jones potential for a molecular system, and would like your help in optimising it. I have read the “Fast N-body Simulation” article several times, but there are difficulties with taking its approach, primarily that I cannot store the position or force vectors in float4 structures, since I need to do some rather involved matrix manipulation (Cholesky square root via the MAGMA library and various multiply and linear-solve operations).
At present I am using the most inefficient calculation method possible - a single thread executing on a single multiprocessor, loading from and writing to global memory. Clearly I’m taking no advantage of parallelism in this implementation, which is something I need to resolve.
I have attached my kernel file to this message, if you are able to give me some advice based on your experience, I would be eternally grateful.
lj.cu (1.03 KB)