Hello, I’m trying to figure out a good way to reduce thread divergence in my GPU-based ODE integration code.
Each thread is integrating an ODE using an integration algorithm that chooses the best time step based on error estimates. Since different threads have different initial conditions, the adaptive time step size will be different for each thread, resulting in divergence (some will need more time steps, in addition to potential divergence in the portion of the code that actually calculates the next time step size).
I thought one way to reduce this would be to use shared memory to use the same time step size in each block (or warp, really). In other words, calculate the minimum step size in the block, then use that for all threads.
This is the way I tried to do that:
__shared__ double cache[BLOCK_SIZE];
...
cache[threadIdx.x] = stepSize;
__syncthreads();
for (int i = 0; i < BLOCK_SIZE; ++i) {
stepSize = fmin(stepSize, cache[i]'
}
However, the performance seems to be worse than just letting the threads diverge, and I’m worried that I’m not doing this in the best way.
Will accessing the shared memory in this way create conflicts, that would reduce performance? And if so, is there a better way to do this?
I’ve looked a lot at global reduction operations, and though about using the local portions of those—would that be better?
You don’t gain anything by avoiding divergence here. Divergence is slower only if threads of a warp execute different instructions.
Instructions that are executed by some threads only do not reduce performance (compared to the situation where all threads execute all instructions, but the additionally executed instructions perform no useful work).
Ok, that might explain why the original code seems to be faster. In this time-stepping algorithm, there isn’t anywhere that some threads in a warp will be doing one thing while others do something else.
Instead, there are a couple of loops that some threads may exit before others (so some are doing more work). If I understand correctly, the threads will converge again after these points?
Yes, and exiting early comes at no penalty (other than that these threads don’t do anything useful).
Having these threads do useless work in order to achieve a uniform control flow will not speed up anything. Indeed it might even slow down execution if a whole warp now performs useless work instead of just sleeping until the synchronization point.