i’m doing a sum of forces problem, i.e. X’i = f(Xi,sum(j,g(Xi,Xj)).
obviously a very data parallel problem.
my question is what is the best way to slice up and distribute the problem to minimize memory bandwidth, etc.
should i use global memory, texture memory, shared memory, or cache (compute capability 2.1)?
should each thread get a different i and each block a subset of j?
or what?
so many possibilities. i’d much rather not try them all to find the best.
any ideas/opinions would be greatly appreciated.
thanks.
-kevin
i’m doing a sum of forces problem, i.e. X’i = f(Xi,sum(j,g(Xi,Xj)).
obviously a very data parallel problem.
my question is what is the best way to slice up and distribute the problem to minimize memory bandwidth, etc.
should i use global memory, texture memory, shared memory, or cache (compute capability 2.1)?
should each thread get a different i and each block a subset of j?
or what?
so many possibilities. i’d much rather not try them all to find the best.
any ideas/opinions would be greatly appreciated.
thanks.
-kevin
Check out the ‘doc’ folder in the n-body example in the CUDA SDK. It’s got a copy of a chapter from GPU Gems 3, where they discuss some of the algorithms used to get good performance out of the GPU (the problem is similar, it’s also O(n^2)). Also, have a look at the Wikipedia article on n-body simulations, which has some detailed information on the algorithms used in the example.
Check out the ‘doc’ folder in the n-body example in the CUDA SDK. It’s got a copy of a chapter from GPU Gems 3, where they discuss some of the algorithms used to get good performance out of the GPU (the problem is similar, it’s also O(n^2)). Also, have a look at the Wikipedia article on n-body simulations, which has some detailed information on the algorithms used in the example.
thanks. yeah, i spoke too soon. i found the nbody example in the sdk, and it’s perfect for my purposes. after i do a little stitching and try it out i might consider modifying it for the faster algorithms.
thanks. yeah, i spoke too soon. i found the nbody example in the sdk, and it’s perfect for my purposes. after i do a little stitching and try it out i might consider modifying it for the faster algorithms.