How to define particles in cuda program?

Hi all,

I am new to cuda and my current task is to realize a Particle Swarm Optimization algorithm using cuda.

The algorithm defines some large number of particles and every particle compute the result of a function, every particle has a initial coordinate and a velocity which is changed after each computation of all particles.

My consideration is to use one thread for each particle, the number of particles can be defined as I wish. My problem is how to define the particles, as the particles are 2-dimension as the simplest, should I have a 2*n matrix and use cudamemcpy to copy each row to the device shared memory? I am kinda confused by the global, constant, shared memory on the device.

Another problem is the velocity of the particles should be modified after every computation, and as a result, I should have syncthread after each computation, I consider that would be a little tricky in the program, could somebody offer an advice?

I would greatly appreciate if you would like to give some help.

Thanks in advance.


Well, the implementation highly depends on a few aspects of your algorithm. Does this “function” depend on the coordinates of any particles other than the one it is modifiying? In other words, is the force on particle i depending on any particles in the system other than particle i?

If not, then life is easy. There is no need for any complicated synchroniztion. __synchtreads() is only needed if you have multiple threads in the block reading and writing the same shared memory. constant memory is cached, but tiny and not useful for 10’s of thousands to millions of particles. Plus it is slow when multiple threads of a warp access different values in constant memory (which the particle code will do).

The best (and simplest) method is to put all particle positions, velocities, and accelerations into global memory and read it in a coalesced manner. Non-coalesced reads will drop performance by a factor or 20 or more.

Here is a simple code example for implementing velocity verlet on this simple problem.

__global__ void advance_particles(float *d_px, float *d_py, float *d_vx, float *d_vy, float *d_ax, float *d_ay, int N, float dt, int nsteps)


   // each thread handles a single particle

   // calculate the index of that particle so that reads are coalesced

   int pidx = blockDim.x * blockIdx.x + threadIdx.x;

  // read in current pos, vel, accel

   float px = d_px[pidx];

   float py = d_py[pidx];

   // ... same for vx,vy,ax,ay

  for (int i = 0; i < nsteps; i++)


      // calculate r(t+deltaT)

      px += vx * dt + 0.5f * ax * dt * dt;

      py += vy * dt + 0.5f * ay * dt * dt;

      // calculate v(t+deltaT/2)

      vx += 0.5f * ax*dt;

      vy += 0.5f * ay*dt;

     // calculate a(t+ deltaT)

      ax,ay = function of (px,py)  

     // calculate v(t + deltaT)

      vx += 0.5f * ax*dt;

      vy += 0.5f * ay*dt;


  // px,py,vx,vy,ax,ay now hold values at the initial t + nsteps

   d_px[pidx] = px;

   d_py[pidx] = py;

   // ....... same for vx,vy,ax,ay


A few caveats:

1)Either N must be a multiple of the block size, or d_px, d_py, etc… must be padded to a multiple of the block size so that memory reads/writes past the end of the array do not occur.

  1. I haven’t benchmarked this myself, but you should try it as an exercise: Very short kernels (i.e., the above one with nsteps=1) have a lot of overhead, and if you count the memory accesses and GFLOPS actually done compared to the device limits, you will be disappointed. I would expect that increasing nsteps would only increase the running time slightly and increase the effective GFLOPS&GB/s measured.

  2. In regards to 2: what if you need another kernel to operate on px,py,px, etc every timestep, or you want to display them? Well, I leave it an exercise to the reader to modify the loop to have each thread handle M particles. This means fewer blocks launched on the device, which will slow performance, but it also means that there is less “small block” overhead which will increase performance. There has to be an optimum there somewhere.

  3. Since I used only blockIdx.x for indexing, you are limited to 65535*blockDim.x particles. That may or may not limit you. It would be easy to make use of blockDim.x and blockDim.y to expand that limit significantly.

I hope I gave you a few things to chew on. I’d be happy to answer any other questions you have. My advice to you would be to try implementing the simple kernel I gave you and play around with it to get a feel for CUDA programming.