Overcoming the 512 Thread Per Block Limit And Synchronization Question on how to deal with the 512 T

There is a limit on the number of threads per block.
Only threads within a single block can be synchronized.

Given:
A problem that is decomposed into a number of pieces that exceed 512, but each of these sub-problems need to be synchronized before continuing, what is the recommended approach for such a problem?

For example, here is a meaningless computation:

Given a list of 1000 integers, we want to do the following:
If the element at index i is greater than element i - 1, increment by 1, else decrement by 1.
We want to do this 10 times.

So, this means that the threads need to be synchronized before the next iteration can occur.
However, the number of threads needed to do all this in parallel would be 1000.
It doesn’t fit into a block, but I want to synchronize.

The example above is an arbitrary example intended to explain my question more clearly, if you have a solution for not requiring synchronization, or reducing the block size, that’s not what i’m trying to do. The actual problem I have has a decomposition that consists of > 512 threads that need to be synchronized.

What is the recommended approach for dealing with this kind of problem?

Thanks in advance for any help! Greatly appreciated! :)

You can’t, it’s a hardware limitation, and incredibly unlikely to be changed (at least in any significant way).

Use separate input and output arrays, and tile it with some tile boundary. example, assuming the left edge doesn’t change

0 5 2 3 4 2

0 6 1 4 5

0 7 0 5
0 5 2 3 4 2 2 3 1 1

	  4 5 1 2 4 0

		6 0 3 5

If you need synchronization, then the recommended approach is to launch separate kernels.

However, I find that the only use for synchronization is to solve dependency problems, where some threads depend on the actions of other threads which may or may not have occurred yet. But synchronization is not the only way to solve dependency problems. For example, in your sample problem, you could have each block load a window that overlaps slightly, and then discards the edge values. This is the solution that gatoatigrado suggested.

Reorganizing the algorithm to avoid needing synchronization may or may not be better than running kernels sequentially, depending on your particular problem.