best way to perform in-block reduction

kyle.niemeyer · March 19, 2013, 2:37pm

Hello, I’m trying to figure out a good way to reduce thread divergence in my GPU-based ODE integration code.

Each thread is integrating an ODE using an integration algorithm that chooses the best time step based on error estimates. Since different threads have different initial conditions, the adaptive time step size will be different for each thread, resulting in divergence (some will need more time steps, in addition to potential divergence in the portion of the code that actually calculates the next time step size).

I thought one way to reduce this would be to use shared memory to use the same time step size in each block (or warp, really). In other words, calculate the minimum step size in the block, then use that for all threads.

This is the way I tried to do that:

__shared__ double cache[BLOCK_SIZE];

...

cache[threadIdx.x] = stepSize;
__syncthreads();
for (int i = 0; i < BLOCK_SIZE; ++i) {
    stepSize = fmin(stepSize, cache[i]'
}

However, the performance seems to be worse than just letting the threads diverge, and I’m worried that I’m not doing this in the best way.

Will accessing the shared memory in this way create conflicts, that would reduce performance? And if so, is there a better way to do this?

I’ve looked a lot at global reduction operations, and though about using the local portions of those—would that be better?

Thanks!

kyle.niemeyer · March 19, 2013, 2:45pm

I do realize that doing this per-warp would probably be better, but I think the shared memory conflict might be causing a greater performance loss.

tera · March 19, 2013, 2:50pm

You don’t gain anything by avoiding divergence here. Divergence is slower only if threads of a warp execute different instructions.
Instructions that are executed by some threads only do not reduce performance (compared to the situation where all threads execute all instructions, but the additionally executed instructions perform no useful work).

tera · March 19, 2013, 2:51pm

[double posting deleted]

kyle.niemeyer · March 19, 2013, 2:59pm

Ok, that might explain why the original code seems to be faster. In this time-stepping algorithm, there isn’t anywhere that some threads in a warp will be doing one thing while others do something else.

Instead, there are a couple of loops that some threads may exit before others (so some are doing more work). If I understand correctly, the threads will converge again after these points?

tera · March 19, 2013, 3:11pm

Yes, and exiting early comes at no penalty (other than that these threads don’t do anything useful).

Having these threads do useless work in order to achieve a uniform control flow will not speed up anything. Indeed it might even slow down execution if a whole warp now performs useless work instead of just sleeping until the synchronization point.

tera · March 19, 2013, 3:13pm

If you really want to speed up your kernel you would need to migrate work to adjacent threads so that whole warps can go to sleep.

kyle.niemeyer · March 19, 2013, 3:21pm

Thanks, you saved me some unnecessary analysis of why this wasn’t faster!