Hi,
I’m just wondering if someone can explain to me some odd results I got with a particle system on the GPU. The particles do particle vs particle interactions.
eg:
for i = 0 … all particles
for j = 0 … all particles
if (i!=j)
interact(i,j)
I tried parallizing this in two different ways:

A 2D approach where each particle interaction runs in a thread
ie : i comes from the threadIdx.x, and j comes from threadIdx.y 
A 1D approach where each particle runs in a thread, and the second for loop to do particle interactions runs in a loop.
ie: i comes from the threadIdx.x, and j from a for loop in the kernel
I found the second approach to be significantly faster. Does anyone know why this might be? I assumed the first approach would be faster since it would use the GPU’s threads & scheduler/latency hiding more efficiently?
Thanks.