In implementing an O(N^3) matrix decomposition algorithm, I would like to synchronize operations between several blocks which presumably are running on distinct multiprocessors. For the blocks to interact with each other properly, I need to synchronize the exchange of control and data between the blocks.
I assumed I could implement a simple spin-wait. The following kernel was intended to launch a grid of several blocks such that each block would wait until all blocks with lesser IDs had finished.
The unfortunate surprise is this kernel deadlocks.
global void StateMachine(volatile int *state) {
// one thread in each block sets the state to zero.
// One block’s write is guaranteed to succeed.
if (threadIdx.y) {
*state = 0;
}
// wait until the state index equals the block ID
int waiting = 1;
do {
if (threadIdx.y == 0) {
waiting = (*state != blockIdx.y);
}
__syncthreads(); // delay
} while (waiting);
// … do relevant processing …
// increment *state to trigger subsequent block
if (threadIdx.y == 0) {
*state = blockIdx.y + 1;
}
Compute Capability 1.1 devices include several useful atomic operations, but it seems like what I have attempted should work without them. Has anyone successfully implemented something like this before (with or without the atomic memory accesses)? Am I missing something obvious?
Is there an alternative to implementing fine-grain synchronization between multiprocessors on the G80?
Blocks run in an undefined order. You cannot, even with a compute 1.1 card, change the execution of another without causing deadlocks. The only exception to this rule is when ALL of the blocks in the grid are executed concurrently. See this thread for a more detailed discussion. [url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA
I’ve chosen eight blocks in the test scenario. On a GTX8800 that should be fewer than the number of multiprocessors, so they should all run concurrently right? Is there a way to ensure this? That said, if I reduce the number of blocks, eventually it should begin ‘working’ without a deadlock before I reach a grid of exactly one block?
Technically, you can implement spin-wait between blocks by using gmem. (I’ve played with this myself, but gave up after a while). However, that’s asking for trouble in many ways and will hang your system quite often. The problem is that block scheduling order is not specified, so you cannot rely on a particular sequence. Therefore it’s possible (and happens) that some block never gets a chance to set a flag, while others are waiting for it.
Why all the threadIdx.y checks? If you have 2D blocks, won’t there be a lot of threads with threadIdx.y == 0?
Ignoring that, what are you trying to accomplish? It looks like you are trying to force your blocks to run sequentially. If that is the cause, why not just do 1 block and a for loop? Though you will be wasting most of the device resources. The programming guide suggests hundreds of blocks to get good interleaving.
Anyways, ignoring that, lets see if we can fix your code.
You probably need a __syncthreads after
if (threadIdx.y) {
*state = 0;
}
to avoid a race condition of some warps reading state before it is written.
Here is your real culprit for the deadlock, though.
int waiting = 1;
do {
if (threadIdx.y == 0) {
waiting = (*state != blockIdx.y);
}
__syncthreads(); // delay
} while (waiting);
waiting is a per-THREAD variable. You only set waiting to 0 on a single thread in the block. All other threads in the block will still read waiting=1 and thus not all threads will call _syncthreads() causing the deadlock.
If you want to implement that loop correctly, waiting needs to be in the shared memory space. It probably needs to be declared volatile to so that the compiler won’t optimize it away.
I am indeed trying to force blocks to run sequentially. Each block stores part of a matrix in shared memory, and one block computes a series of transforms the other blocks may apply to their respective submatrices. If I were to launch a kernel for each step of that procedure, I would have to perform a bulk read and bulk write for each kernel invocation. With a single launch and synchronized blocks, I believe I can reduce this to one read and one write per run of the algorithm.
With synchronization between blocks, there are other instances in which I can overlap large memory transfers from global to shared while another block processes the data it has already fetched.
Perhaps there is a better way. Is it possible to guarantee the persistence of shared memory across kernel invocations?