Thread Divergence in Particle Simulations with Reflecting Boundary conditions

Hey guys,

I am writing a particle simulation in CUDA C and in my simulation the particles move about within a bounding primitive such as a cylinder or sphere. The trajectories are independent of each other and therefore particle collisions are not an issue. However, when a particles trajectory intersects the bounding primitive, the particle is reflected (specularly), sometimes more than once.

The pseudo-code for the boundary intersection and reflection would look something like this:

while ( boundaryIntersection ( particlePosition ) == true ){

    place particle at boundary;
    reflect particle;


Depending on how complicated the bounding object is and where the particle is located, the while loop may be executed lots of times for a thread or not at all. To battle the thread divergence, one idea I had was to sort the particles according to their distance to the bounding object. Particles close to the boundary would be grouped together and therefore more likely to execute in unison. For this type of simulation, would there be any better strategies for minimizing the divergence?


Do You calculate the whole particle path within the boundary in the loop You provided above? It is a bit unclear for my, I daresay.


You could probably resort to using ternary operators and let 20-50% of the warp wotk along doing “meaningless” work rather than letting it diverge. It’s hard to say how this would perform as it’s a data dependant “solution”.

On another note: You can do “reflections” in texture memory by setting it to MirrorMode, then you might be able to perform reflections directly in hardware!

In other words (as I would do it):

  1. Every particle has a position p and a velocity vector v (direction of movement).
  2. In every simulation frame an intersection of a particle and boundary is computed (according to the particle position) - function 'intersect()', that returns 1 for hit and 0 for no hit.
  3. Finally particle new direction and position are computed as follows:
v = v * (!intr) + reflect(v,n) * intr;
p = p + v;

where intr is outcome of ‘intersect()’ function and ‘n’ is a normal vector for the hit point.


This is a fairly common type of divergence. You can reduce the overhead by periodically compacting the warp.

For example

bool done = false;

    for(unsigned int i = 0; i < N; ++i)
        if(! boundaryIntersection(particlePosition))
            done = true;
        place particle at boundary;
        reflect particle;

    done = compactWarp(done);

The compactWarp function performs stream compaction on the threads that are still active.

The other strategy that I have seen is to process multiple particles with the same thread. The goal is typically to reduce the length of time when only some threads are active. Be careful about this though. It works best when the variance in iteration counts of particles is small.

Thanks for the suggestions so far! I will try them out and report back.

cmaster.matso - Yep, I calculate the entire path within the while loop. If the boundaryIntersection method returns true, it sends back some useful information to the while loop for calculation of the trajectory.