There is a limit on the number of threads per block.

Only threads within a single block can be synchronized.

Given:

A problem that is decomposed into a number of pieces that exceed 512, but each of these sub-problems need to be synchronized before continuing, what is the recommended approach for such a problem?

For example, here is a meaningless computation:

Given a list of 1000 integers, we want to do the following:

If the element at index i is greater than element i - 1, increment by 1, else decrement by 1.

We want to do this 10 times.

So, this means that the threads need to be synchronized before the next iteration can occur.

However, the number of threads needed to do all this in parallel would be 1000.

It doesn’t fit into a block, but I want to synchronize.

The example above is an arbitrary example intended to explain my question more clearly, if you have a solution for not requiring synchronization, or reducing the block size, that’s not what i’m trying to do. The actual problem I have has a decomposition that consists of > 512 threads that need to be synchronized.

What is the recommended approach for dealing with this kind of problem?

Thanks in advance for any help! Greatly appreciated! :)