best memory access pattern for O(N^2) problem?

i’m doing a sum of forces problem, i.e. X’i = f(Xi,sum(j,g(Xi,Xj)).

obviously a very data parallel problem.

my question is what is the best way to slice up and distribute the problem to minimize memory bandwidth, etc.

should i use global memory, texture memory, shared memory, or cache (compute capability 2.1)?
should each thread get a different i and each block a subset of j?
or what?

so many possibilities. i’d much rather not try them all to find the best.

any ideas/opinions would be greatly appreciated.
thanks.
-kevin

i’m doing a sum of forces problem, i.e. X’i = f(Xi,sum(j,g(Xi,Xj)).

obviously a very data parallel problem.

my question is what is the best way to slice up and distribute the problem to minimize memory bandwidth, etc.

should i use global memory, texture memory, shared memory, or cache (compute capability 2.1)?
should each thread get a different i and each block a subset of j?
or what?

so many possibilities. i’d much rather not try them all to find the best.

any ideas/opinions would be greatly appreciated.
thanks.
-kevin

Check out the ‘doc’ folder in the n-body example in the CUDA SDK. It’s got a copy of a chapter from GPU Gems 3, where they discuss some of the algorithms used to get good performance out of the GPU (the problem is similar, it’s also O(n^2)). Also, have a look at the Wikipedia article on n-body simulations, which has some detailed information on the algorithms used in the example.

Check out the ‘doc’ folder in the n-body example in the CUDA SDK. It’s got a copy of a chapter from GPU Gems 3, where they discuss some of the algorithms used to get good performance out of the GPU (the problem is similar, it’s also O(n^2)). Also, have a look at the Wikipedia article on n-body simulations, which has some detailed information on the algorithms used in the example.

thanks. yeah, i spoke too soon. i found the nbody example in the sdk, and it’s perfect for my purposes. after i do a little stitching and try it out i might consider modifying it for the faster algorithms.

-kevin

thanks. yeah, i spoke too soon. i found the nbody example in the sdk, and it’s perfect for my purposes. after i do a little stitching and try it out i might consider modifying it for the faster algorithms.

-kevin