i’m doing a sum of forces problem, i.e. X’i = f(Xi,sum(j,g(Xi,Xj)).

obviously a very data parallel problem.

my question is what is the best way to slice up and distribute the problem to minimize memory bandwidth, etc.

should i use global memory, texture memory, shared memory, or cache (compute capability 2.1)?

should each thread get a different i and each block a subset of j?

or what?

so many possibilities. i’d much rather not try them all to find the best.

any ideas/opinions would be greatly appreciated.

thanks.

-kevin