Synchronization problem How can we synchronize blocks?

__syncthreads() allows to synchronize threads in a block.

Is there any function to synchronize blocks (in the same grid) while we are inside the kernel?

This has been discussed here many times.

The answer to your question is no. There is no function. Actually the CUDA execution model prohibits grid-wide synchronization of blocks.

One solution that works for some people is to split up the kernel into multiple parts and start them sequentially. Whenever one kernel returns and the next one is started the blocks are synchronized.

Another solution that worked for other people is to program a mutex. This however only works within very limited parameters (the size of the grid) and also requires compute capability 1.1 (atomic operations). Others have discussed this for compute capability 1.0 but I’m not sure how far they got (

Anyway just use the forum search and look for “synchronization”. You’ll get plenty of hits where your question is discussed.


Actually, it is possible to do block synchronization using just reads and writes. This is based on some fairly old results in the literature on scheduling.

Imagine N+1 thread blocks, with N workers and 1 scheduler. Each worker has 1 “comm” slot, which we will assume is atomically read or written. For simple applications the comm slot can be an int.

Startup is not too bad. Block 0 is the scheduler. The choice of block number for the scheduler is arbitrary, but the scheduler has to start. The initial value in the comm slot for all workers is FREE, and the very first thing a work does is to change it to AVAIL.

The scheduler has a pool of work units, and assigns the index for the work unit to some arbitrarily chosen ready worker by writing the index (plus a REQUEST bit) to the comm slot. The worker performs the work, and writes the index of the result (plus a REPLY bit) to the comm slot. The scheduler polls the comm slots for replies and acts on the replies, which can determine further work unit assignments. The scheduler can post work units to any ready worker, which is marked as AVAIL or that has a valid REPLY (AVAIL can be a special case of REPLY).

Shutdown initiates when the scheduler runs out of work units to initiate. The scheduler waits for all busy workers to become ready (with AVAIL or REPLY), then sets the scheduler comm slot to FINISH, and terminates. Each worker polls the scheduler comm slot in addition to polling its own comm slot, so the workers should terminate when they observe the global FINISH.

I’ve written test code for this method, and it works, although I don’t have permission yet to post the code (sorry). There are many complications, especially if one needs to communicate large amounts of mutable data between workers, but the outline remains the same.

This code will break if we cannot reliably find a block index to give to the scheduler (block 0 works in practice so far, but nVidia could change things). It will also break if reads and writes to integers are not atomic, or if the thread blocks are not allowed to run to completion once started. As far as I can tell this code does not depend on the order of memory requests being preserved.

I don’t see the ingenious idea behind this method. This is just a simple scheduling algorithm.

The limitations still seem to be: atomic reads and writes are required. And very crucial: the number of blocks that can be started is limited to the number of blocks that can run concurrently on the card. The last point introduces more problems:

  1. this number varies from card to card.
  2. this enforces a very low number of blocks on your kernel which could result in poor performance/occupancy.

Maybe I got your idea wrong, would you like to clarify?

Actually, I did not claim that this is especially new, I just said that I can’t post the code yet. It is a fairly basic scheduler, adapted to the CUDA environment. In principle one can use the scheduler to provide crude mutexes by a suitable request/reply protocol, but only if the mutex handling occurs at work unit boundaries. It’s better to avoid mutexes altogether by scheduling the work as a series of actions that don’t need mutexes.

Yes, atomic reads and writes are required, but for this simple case I think that I can rely on atomic reads and writes for 32-bit integers, provided that they are only read and written as words. Atomic read/modify/write operations are NOT required. I’ve not yet determined if it is safe to use int2 or int4 as atomic. Cascading up from one-word atomic to multi-word atomic is tricky, and I’ve not yet tested that either. It would be nice to know more about memory access reordering for CUDA, but I expect silence on this because nVidia would like to be able to change things around in the future.

Although you could read the configuration and only launch enough blocks to cover the active blocks, this approach allows one to launch more blocks than would be active. The schedule never sees the non-active blocks because they never transition from the FREE to AVAIL state until all of the work is done, at which point they all exit quickly because the scheduler has a FINISH state. The protocol is quite forgiving: you can have more or less workers than the number of active blocks, although having too many workers (within reason) carries little penalty.

Also, you get to launch at least as many blocks as the maximum active number of blocks, which means that you are using all of the resources that the card has. Blocks that are non-active don’t get run on the hardware scheduler, either.

I hope that this answers a few questions. I have no practical application for this kind of block scheduler myself, but I thought that it was an interesting programming challenge.

It should be. This was confirmed in a post a ways back, I don’t remember enough to find it again.

Now it is clearer to me how you designed your algorithm. Thanks for clarifying. I now agree that this algorithm can work.
It’s a nice solution. I can’t imagine a practical application either but I agree - it is an interesting challenge.

I should point out that there is one non-trivial problem using this scheduler scheme for mutexes on shared memory. The problem is as follows:

1: Worker X writes value V to location A in global memory

2: Worker X writes the comm slot X, granting access to A

3: Scheduler reads comm slot X and comm slot Y

4: Scheduler writes comm slot Y, giving a work unit to worker Y

5: Worker Y reads comm slot Y

6: Worker Y reads location A

Assume that there are sufficient syncthread calls to eliminate divergence within a block. The hard part is determining what can be inferred bout the state of some arbitrary global location given knowledge of the comm slot.

The question is: in step 6, will worker Y always read the value V from location A? Or can that write (from step 1) be delayed such that worker Y will read a stale value?

So far I don’t know the answer. Assuming int2 and int4 writes are atomic (see earlier post from MA42) it would be best to declare the comm slots to be quadwords wide and only rely on the data being passed through the comm slots.

Or am I asking for the kimono to be excessively parted?

When reading your idea I imagined the worker code to look something like this: (I’m referring to one worker block as worker):

  1. worker block starts
  2. worker writes ready signal to workerspecific com slot
  3. worker waits to get released by scheduler (busy waiting loop?) - locks itself on com slot
  4. worker gets released by scheduler
  5. worker works, e.g. all threads calculate something and write it back to dedicated global memory area (scheduler ensures that no other block is released on this specific global memory space)
  6. worker synchronizes itself (__syncthreads()) to ensure all threads are done wit their work
  7. worker writes ready signal to com slot

I think point 5 should ensure that the worker (all threads of the worker block) is done and no memory writes are pending.
I don’t know if nvcc reorders instructions. If yes this could be a problem because nvcc might realize that it doesn’t matter when the ready signal is written to global memory because for nvcc it’s all the same.

seb, you are basically correct. I’m not concerned about the syncthreads, although they are definitely needed in spots. I’m concerned with the reordering of writes done in one block as “seen” by reads in another block. It does not really matter if the reordering is done by nvcc or the memory subsystem.

It is a good idea to declare the comm slots and any memory that may be shared as volatile to encourage reads and writes to take place as they are requested, and not optimized away. However, this only addresses one part of the reordering dilemma.

I’m spending far too much time worrying about aspects of CUDA I don’t use. On the other hand, it is interesting in theory.