Using global memory for broadcasting... Is there a better solution?

Many parallel algorithms require some sort of broadcasting among all processors. What would you do if a processor/thread requires the output from another thread in another block? I’ve been using global memory and a loop that checks the global memory until it finds a change. It may look something like this:

while (g == 0); // waiting as long as g == 0

foo(g); // g is ready to be used

Is there any other solution for achieving the same effect?

I run into this very same problem everytime I want to parallelize a dependent nested loop like this:

for (int k = 0; k < n; k++) {

   for (int i = k; i < m; i++)

       // do something with a[k][i] assuming a[k][i] depends on a[k-1][i]

}

You can think of this code as an algorithm with k stages with each stage depending on the previous k-1 stages. So, the first row (a[0][i]) can be directly computed but all the other rows have to wait until the first row is ready before they move to the next stage. The second row only depends on the first row. Once the second row is ready, rows 3, 4, … etc. can go one step further (this would be the last step for row 3).

So to parallelize such an algorithm, some sort of a broadcasting should be used. Each thread should wait until the row it depends on, is ready to be used.

What is your solution?

Simplest way is to each stage in a different kernel call. Global synchronizations are implicit between kernel calls.

Thanks for your suggestion. I thought of different kernel calls too but wouldn’t it be slower than my waiting loop solution?

The problem here is in your algorithm rather than your choice of synchronization mechanism.

If a[k][i] really depends on a[k-1][i], then what you have in these cases is a chain of back-to-back true dependencies. Usually in these cases you want to change the order of iteration and the assignment of work to threads. For example, if your calculation took this form:

T0 T1 T2 T3
XX T0 T1 T2
XX XX T0 T1
XX XX XX T0

You would want to change it to something like this:

T0 T1 T2 T3
XX T1 T2 T3
XX XX T2 T3
XX XX XX T3

In the first case, the longest chain of dependencies is T0->T1->T2->T3. In the second case, the longest chain is, T3-T3->T3->T3, and passing variables within a thread is much faster than written them to memory and signaling using barriers.

It depends. If your waiting loop solution happens to work, because you are lucky and all blocks execute concurrently and the stars align, then yes, it’s slightly faster (on the scale of microseconds).

More often, the separate kernel invocation will be faster though because your waiting loop will turn out to be an infinite loop…