Particle vs Particle optimization

Hi,

I’m just wondering if someone can explain to me some odd results I got with a particle system on the GPU. The particles do particle vs particle interactions.
eg:
for i = 0 … all particles
for j = 0 … all particles
if (i!=j)
interact(i,j)

I tried parallizing this in two different ways:

  1. A 2D approach where each particle interaction runs in a thread
    ie : i comes from the threadIdx.x, and j comes from threadIdx.y

  2. A 1D approach where each particle runs in a thread, and the second for loop to do particle interactions runs in a loop.
    ie: i comes from the threadIdx.x, and j from a for loop in the kernel

I found the second approach to be significantly faster. Does anyone know why this might be? I assumed the first approach would be faster since it would use the GPU’s threads & scheduler/latency hiding more efficiently?

Thanks.

First approach loads n^2 times TWO particles.
Second approach loads n^2 particles.

Im guessing particle positions come from global memory. Youre therefore doing twice as much work for the same result.
Check out the tile algorithm in http://developer.download.nvidia.com/compu…_gems3_ch31.pdf, it should be of great help to you all-pairs problem.