Particle vs Particle optimization


I’m just wondering if someone can explain to me some odd results I got with a particle system on the GPU. The particles do particle vs particle interactions.
for i = 0 … all particles
for j = 0 … all particles
if (i!=j)

I tried parallizing this in two different ways:

  1. A 2D approach where each particle interaction runs in a thread
    ie : i comes from the threadIdx.x, and j comes from threadIdx.y

  2. A 1D approach where each particle runs in a thread, and the second for loop to do particle interactions runs in a loop.
    ie: i comes from the threadIdx.x, and j from a for loop in the kernel

I found the second approach to be significantly faster. Does anyone know why this might be? I assumed the first approach would be faster since it would use the GPU’s threads & scheduler/latency hiding more efficiently?


First approach loads n^2 times TWO particles.
Second approach loads n^2 particles.

Im guessing particle positions come from global memory. Youre therefore doing twice as much work for the same result.
Check out the tile algorithm in…_gems3_ch31.pdf, it should be of great help to you all-pairs problem.