Sorting particles into cells

ChrisTC · January 31, 2011, 11:13pm

Hi there,

I’m fairly new to CUDA but gained some experience by rewriting a particle-based simulation model in CUDA. However, the performance gain
is quite slow (at about 10x compared to the CPU) - basic problem is the particle sorting into a uniform grid which is the major time-consumer at MY simulation as the collision part is quite simple (as opposed to other particle simulations, where the interaction part is the performance-gainer at the GPU).

I tried both sorting algorithms from Simon Greens particles simulation (sdk examples) but
they are quite slow due to the scattered read/write (the atomic algorithm) or the sorting (the sorting based algorithm). Additionally,
I will try and use texture fetches but I don’t expect a huge
performance gain by that.

Any ideas for improvement?

cheers, Chris

SPWorley · January 31, 2011, 11:21pm

You might look at HOOMD, both the paper and the code. Mr. Anderson does a lot of this kind of voxel distribution.

ChrisTC · January 31, 2011, 11:32pm

Nice pointer, thanks - at the first glance, they didn’t speed up the particle ‘binning’ more than I did - but in molecule dynamics they just do the binning every 10th timestep while I need it

twice every timestep…

DrAnderson42 · February 1, 2011, 1:46pm

Hey, you didn’t have to cross-post to the hoomd-users mailing list - I post in this forum regularly!

The original paper is way out of date. GPUs are so fast now, that the binning on the host was actually becoming a 20% bottleneck!

Fermi and its awesome L2 cache comes to the rescue.

My current binning algorithm, which gets about a 10x speedup over the host, is as simple as:

for all particles in parallel

b = computeBin(x,y,z)

n = atomicInc(&d_bin_size[b])

d_ bins[b][n] = index;

Done!

Since hoomd is open source, you can check out the full source code for yourself. Look for CellList.cu in the hoomd svn repository. There you will also find a method tuned for Compute 1.2/1.3 devices - but it is a lot more complicated to explain and not as simple as the version for compute 2.x. If you are really curious about it, search for an older forum post on binning by MisterAnderson42 (my former identity) on these forums. It does some sorting per block to reduce the number of atomicAdds that need to be made.

ChrisTC · February 1, 2011, 3:06pm

Hey, you didn’t have to cross-post to the hoomd-users mailing list - I post in this forum regularly!

The original paper is way out of date. GPUs are so fast now, that the binning on the host was actually becoming a 20% bottleneck!

Fermi and its awesome L2 cache comes to the rescue.

My current binning algorithm, which gets about a 10x speedup over the host, is as simple as:

for all particles in parallel

b = computeBin(x,y,z)

n = atomicInc(&d_bin_size[b])

d_ bins[b][n] = index;

Done!

Since hoomd is open source, you can check out the full source code for yourself. Look for CellList.cu in the hoomd svn repository. There you will also find a method tuned for Compute 1.2/1.3 devices - but it is a lot more complicated to explain and not as simple as the version for compute 2.x. If you are really curious about it, search for an older forum post on binning by MisterAnderson42 (my former identity) on these forums. It does some sorting per block to reduce the number of atomicAdds that need to be made.

Ah ok, thanks for your support! I will try out the sorting method to reduce the atomic operations and see what the performance gain is…