Optimization of Molecular Dynamics Code. (S

Hey Folks,

I am trying to port an existing Molecular Dynamics Application to the GPU. I am done coding and am trying to optimize my code.

I have tried this approaches:

Using fast-math; __fmul_rn() for multiplication of 2 floats,
__fdividef() for division

__mul24() for multiplication of integers.

I have also tried putting #pragma unroll above all the inner for loops that I have in my kernel.

But all this to no use. I cannot get any speedup. Can anyone explain, why dont I get any speed up using these. What else can I do

I know Shared memory is one. But I am not using it right now.

Please help.

Hi, it’s likely that your computations are memory-bandwidth limited. That is the time it takes to load data from Global Memory and to write the results back may be by the order of magnitude greater than the computations you do. In this case optimizing computation will not give any speedup. I suggest you to read a PDF on reduction

reduction.ppt

In our air flow dymanics GPGPU code we also had a number of conditional braches (as they were present in original C++ code) so branching was another bottleneck issue , since, as you know shaders have to follow all the lines in the false condition branches uless they ALL can omit that in a warp (?).

How can I check whether my application is memory bound or not?

And I am using GTX 280, thus, I heard that I takes care of making coalesced accesses to global memory. Is it true or do I need to take care of that?

you can compute (number of float operation) / (number of global read/write), if this number is small, then

your application is memory-bound.

maybe you can post part of your kernel code and we can check it.

It is less restrictive, but it is not cached - so you need to take some care. The most important issue however is the fact you are not using shared memory. You are reusing many times the particle positions to evaluate forces, and so they need to be cached into shared memory. Then, need to take care to coalescing/bank conflicts to avoid slowness. But the first point count for the 90%, the second 10%.

The point is that porting to GPU is not just about translating the code, but you need to reorganize it to respect the memory structure of this architecture. From what I have heard the next nvidia architecture (FERMI) will have a cache and so will probably will be less restrictive on this regard.

Coming back to your problem, in your place I would investigate on how others have approached the problem: I know lammps is now CUDA enabled… I do not know about the others… gromacs, namd,…

Well, it is not possible for me to post the whole kernel because its quite big and I wont be able to make people understand what is happening.

I can definitely check the number of float operations and number of global memory read/write. Do you mean to get these numbers from the cuda Profiler or is there any other way I can do that.

The point that I am not able to use the Shared Memory is because its not just atom to atom computation for me. I am actually using an approximation method called Hierarchical Charge Partitioning on which I had worked upon myself. In this case, I have data about Complex, Strands and Residues and atoms.

So, what it does is, it computes the distance between an atom and an approximate point charge of the complex, if this distance is greater than the set threshold value, I just use that and loop over all the atoms of the complex, likewise for strands and residues.

So, I do not know what would be required when and thus, cannot know what to put in the shared memory, as my data is not similar, if you understand what I am trying to mean.

you don’t need cuda Profiler to compute this ratio, just estimate it from your code.

you can describe pseudocode of sequential version and pseudcode of cuda version.

you need to re-organize order of execution such that you can re-use data

(put data into shared memory, sigismondo has memtioned this and matrix multiplication is

a good example). Second you can check coalesced pattern for global read/write.

Here it goes. The serial version is like this:

Hierarchically, a molecule can be broken into Complex, Strands in Complexes, Residues in Strands and Atoms in Residues.

I have a separate list for each which stores the Co-ordinates and the Charge of each sub part. So, 4 lists, these are arrays of Structures.

Next, I iterate through each atom and check whether its distance from the Complex is greater than a certain threshold, then I just compute potential taking the charge of the Complex and skip all the atoms in it. If it is not greater than the threshold, then I compute the distance between the atom and the strand and do the same thing.

For CUDA version:

I have so many arrays to pass that if I pass them as it is, it exceeds the 256 bytes of the size of kernel parameters. Thus, I pack them all into a structure and pass a pointer to that structure in the kernel and each thread computes the potential for one atom. Therefore number of threads = number of atoms.

In the kernel, I loop over each atom and do what I describe for the serial version.

As for coalescing is concerned, my accesses are partially coalesced and I am still wondering how can I achieve that because each thread accesses data from various locations, i.e, for all the arrays of structures which I had just described. For example, you can consider computing the distance between an atom and a sub unit. For this, I access the locations for the atom array which has x,y,z co-ordinates of the atoms and 3 locations of one more array of the subunit likewise. Therefore, 6 locations to be accessed for a thread. And I have to do this for all the 4 sub parts so in total 24 different locations.

Phew. Has been a bit long, but your suggestion and help is more than welcome.

Thanks.

I have no experience on molecular dynamics, I try to understand what you do

for each atom 

	for each complex 

		if ( distance(atom, complex) is greater or equal to threahsold ) then 

			compute potential taking the charge of the Complex 

		else

			compute the distance between the atom and the strand 

		endif

	endfor

endfor

this is O(N^2) method and you need to load (x,y,z) coordinate of atom and complex for each potential computation.

If this is the picture, then I would suggest that

use shared memory to store a block of atoms, and a part of complexes,

__shared__ point atom[BLOCKS];

__shared__ point partOfComplexes[BLOCKS];

for each subunit of complex block

	step 1: load sub-block into "partOfComplexes"

	synch.

	

	step 2: compute potential for each complex in "partOfComplexes"

		for each complex in "partOfComplexes"

			if ( distance(atom, complex) is greater or equal to threahsold ) then 

			compute potential taking the charge of the Complex 

		else

			compute the distance between the atom and the strand 

		endif

end

this would save amount of global memory read/write.

when talking about coalesced pattern, I think that it is better to use x-array, y-array, and z-array to store coordinate

(not array of structure (x,y,z) ), then you would have colesced read for atom and complex.

as for branch, this is unavoided in your application.

In order to avoid problem of partition camping, I would suggest that each threads block deal with one partition of global memory.

A very crude but easy way… change your graphics card’s memory speed. If you can make memory slower (say 80% speed) and yet your kernel speed is unchanged, then you’re likely not global memory throughput bound.

A more scientific measurement is to make a graph of memory speed versus application speed and see if you get a flat spot (memory speed doesn’t matter) or a constant slope (memory bound), and perhaps even see a soft kink joining the two regions (showing where the transition from compute to memory bottleneck occurs.)

This isn’t super-useful for telling WHY you’re memory bound but it is really fast to see at least if you ARE memory bound.

There are several apps for changing your shader and memory clocks, includingNV’s own nTune.

Note you can keep memory constant and reduce shader clocks instead to see “the other half” of the graph.

And some apps aren’t limited by either clock… for example if you’re PCIe bandwidth bound.

I was also thinking on the same lines but could not find any application which would allow me to change the memory frequency of my GTX 280 on linux. If you know any, please let me know.

I did manage to change the clock frequency of the GTX 280. I used the Coolbits option in xorg.conf. However, I find there are 2 clock frequencies; GPU (core) and Memory. I would want to know which is used for what purpose. Changing which one would affect the speed of data transfers to/from the Global Memory.

I am doing this to find whether my application is memory bound or compute bound.

I see. I understand what you mean: the substance do not change: the same information needs to be reused many times per cycle, so it is mandatory to cache it to shared memory, to have its reuse optimized.

I would subdivide your task for the evaluation of the forces in subloops:

atoms-complex

atoms-strands

strands-complex

atoms-atoms

then optimize it in the common way as for atoms (load to shared memory blocks of atoms-complexes-strands… - they won’t fit 16KB typically - evaluate all the force contributions due to them, and iterate over all the blocks, until all the forces are evaluated, with the cooperation of all the possible threads).

I would start from the most time consuming - some profiling will help you in this task. I guess this is going to be the atoms-atoms, except you do not have long range electrostatics… in this case FFT time (Ewald) dominates all the short range interactions - and you can just start parallelizing it.

The most tricky part is in the code that prepares the data to the loops, that guarantee that you are looping over the correct stuff. Some binning will help you to avoid to evaluate all the distances, and let this part be O(N). Obviously I am thinking to large stuff!

The loops would be the compute intensive part.

Again, I think you can trash this post into the basket if you have Ewald sums involved: in this case they count for the most of the computing effort.

If this can help, I would start porting and optimizing some easy loop - a lennard-jones argon MD - just to take more confidence with this architecture. Then parallelize within a block, and then among more blocks: as I was saying, porting to the GPU is not just matter of translating. With a problem easier to manage, you will start mastering cuda much more easily.

The alternative - wait for nvidia to commercialize their FERMI gpus - that incorporate a cache, and hope it works properly!!! :rolleyes:

Hi,

have a look at HooMD, it uses CUDA.

As mentioned earlier, the biggest problem is probably how to speed up your O(N^2) operation. As a first step, you should try to use shared memory in your neighborlist update. That could lead to a significant speedup, however it does not reduce the complexity. Then you might want to look at cell lists (instead of Verlet neighbor lists), which becomes mandatory if your system is large (>1000 ptls).

kind regards,
Jens.