i’m doing a sum of forces problem, i.e. X’i = f(Xi,sum(j,g(Xi,Xj)).
obviously a very data parallel problem.
my question is what is the best way to slice up and distribute the problem to minimize memory bandwidth, etc.
should i use global memory, texture memory, shared memory, or cache (compute capability 2.1)?
should each thread get a different i and each block a subset of j?
so many possibilities. i’d much rather not try them all to find the best.
any ideas/opinions would be greatly appreciated.