I have no experience on molecular dynamics, I try to understand what you do
for each atom
for each complex
if ( distance(atom, complex) is greater or equal to threahsold ) then
compute potential taking the charge of the Complex
else
compute the distance between the atom and the strand
endif
endfor
endfor
this is O(N^2) method and you need to load (x,y,z) coordinate of atom and complex for each potential computation.
If this is the picture, then I would suggest that
use shared memory to store a block of atoms, and a part of complexes,
__shared__ point atom[BLOCKS];
__shared__ point partOfComplexes[BLOCKS];
for each subunit of complex block
step 1: load sub-block into "partOfComplexes"
synch.
step 2: compute potential for each complex in "partOfComplexes"
for each complex in "partOfComplexes"
if ( distance(atom, complex) is greater or equal to threahsold ) then
compute potential taking the charge of the Complex
else
compute the distance between the atom and the strand
endif
end
this would save amount of global memory read/write.
when talking about coalesced pattern, I think that it is better to use x-array, y-array, and z-array to store coordinate
(not array of structure (x,y,z) ), then you would have colesced read for atom and complex.
as for branch, this is unavoided in your application.
In order to avoid problem of partition camping, I would suggest that each threads block deal with one partition of global memory.